scikit-learn: A walk through of GroupKFold.split()

January 25, 2018 2 分钟阅读

Suppose $X [“groups”] = [\begin{matrix} a \\ b \\ b \\ c \\ c \\ c \end{matrix}]$ and n_splits=3.

Then GroupKFold.split(X, y, X["groups"]) will run into the _iter_test_indices method which simply yields the indices of the test folds.

# Parameter groups == X["groups"]
unique_groups, groups = np.unique(groups, return_inverse=True)

\begin{aligned} u n i q u e_g r o u p s & = [\begin{array}{c} a \\ b \\ c \end{array}] \\ g r o u p s & = [\begin{array}{c} 0 \\ 1 \\ 1 \\ 2 \\ 2 \\ 2 \end{array}] \end{aligned}

So this groups is an interesting index: if X["groups"] has $n$ unique values, groups could assign $n$ markers to the original X["groups"]. E.g.

markers = np.array(['△', '○', '□'])
markers[[0, 1, 1, 2, 2, 2]] == array(['△', '○', '○', '□', '□', '□'], dtype='<U1')

m a r k e r s [g r o u p s] = [\begin{matrix} △ \to a \\ ○ \to b \\ ○ \to b \\ □ \to c \\ □ \to c \\ □ \to c \end{matrix}]

And especially, unique_groups[groups] == X["groups"].

n_groups = len(unique_groups)  # 3
 
# Weight groups by their number of occurrences
n_samples_per_group = np.bincount(groups)

n_s a m p l e s_p e r_g r o u p = [\begin{matrix} 1 \\ 2 \\ 3 \end{matrix}]

# Distribute the most frequent groups first
indices = np.argsort(n_samples_per_group)[::-1]
n_samples_per_group = n_samples_per_group[indices]

\begin{aligned} i n d i c e s & = [\begin{array}{c} 2 \\ 1 \\ 0 \end{array}] \\ n_s a m p l e s_p e r_g r o u p & = [\begin{array}{c} 3 \\ 2 \\ 1 \end{array}] \end{aligned}

# Total weight of each fold
n_samples_per_fold = np.zeros(self.n_splits)  # [0, 0, 0]

# Mapping from group index to fold index
group_to_fold = np.zeros(len(unique_groups))  # [0, 0, 0]

# Distribute samples by adding the largest weight to the lightest fold
for group_index, weight in enumerate(n_samples_per_group):
    lightest_fold = np.argmin(n_samples_per_fold)
    n_samples_per_fold[lightest_fold] += weight
    group_to_fold[indices[group_index]] = lightest_fold

group_index = 0； weight = 3
- lightest_fold = 0
- n_samples_per_fold[0] = 3
- group_to_fold[2] = 0
group_index = 1; weight = 2
- lightest_fold = 1
- n_samples_per_fold[1] = 2
- group_to_fold[1] = 1
group_index = 2; weight = 1
- lightest_fold = 2
- n_samples_per_fold[2] = 1
- group_to_fold[0] = 2

g r o u p_t o_f o l d = [\begin{matrix} 2 \\ 1 \\ 0 \end{matrix}]

indices = group_to_fold[groups]

Key step! group_to_fold is actually a marker triple here.

i n d i c e s = g r o u p_t o_f o l d [g r o u p s] = [\begin{matrix} 2 \to a \\ 1 \to b \\ 1 \to b \\ 0 \to c \\ 0 \to c \\ 0 \to c \end{matrix}]

for f in range(self.n_splits):
    yield np.where(indices == f)[0]  # note that `np.where` here return a one-elemented tuple

The 1st split: f = 0, yield np.array([3, 4, 5])
The 2nd split: f = 1, yield np.array([1, 2])
The 3rd split: f = 2, yield np.array([0])

# This is an abstract class， `_iter_test_indices` being the abstract method
class BaseCrossValidator(with_metaclass(ABCMeta)):
    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        indices = np.arange(_num_samples(X))  # array([0, 1, 2, 3, 4, 5]) here
        for test_index in self._iter_test_masks(X, y, groups):
            train_index = indices[np.logical_not(test_index)]
            test_index = indices[test_index]
            yield train_index, test_index

    def _iter_test_masks(self, X=None, y=None, groups=None):
        """Generates boolean masks corresponding to test sets.
        By default, delegates to _iter_test_indices(X, y, groups)
        """
        for test_index in self._iter_test_indices(X, y, groups):
            test_mask = np.zeros(_num_samples(X), dtype=np.bool)
            test_mask[test_index] = True
            yield test_mask

    def _iter_test_indices(self, X=None, y=None, groups=None):
        """Generates integer indices corresponding to test sets."""
        raise NotImplementedError

The 1st split:
- test_mask == np.array([False, False, False, True, True, True])
- train_index == np.array([0, 1, 2])
- test_index == np.array([3, 4, 5])
The 2nd split:
- test_mask == np.array([False, True, True, False, False, False])
- train_index == np.array([0, 3, 4, 5])
- test_index == np.array([1, 2])
The 3rd split:
- test_mask == np.array([True, False, False, False, False, False])
- train_index == np.array([1, 2, 3, 4, 5])
- test_index == np.array([0])

P.S. Note that, given its input, GroupKFold’s output is fixed. No random seed is needed.

X Facebook LinkedIn Bluesky

scikit-learn: A walk through of GroupKFold.split()

分享

留下评论

猜您还喜欢

LL(1) Parsing

Top-Down Parsers: Recursive Descent, Predictive, and More

Appetizers Before Parsing: Serving Order

Appetizer #4 Before Parsing: Left Factoring