In the previous sections (T tests), we  talked about testing hypotheses related to the average of one or two sets of data, assuming they followed normal distributions and were on a numerical scale. Now, when data are categorical, we use different methods discussed in the section on proportions.

The techniques from the T tests section are still useful for big samples, thanks to the central limit theorem. This theorem tells us that with big enough samples, the averages tend to follow a normal distribution, as long as certain conditions are met.

But what if our data doesn’t come from a normal distribution? Well, we can still use the methods applied in the T tests section. We don’t assume that our sample itself is normally distributed, but rather that the population it comes from is. However, if the underlying distribution isn’t normal, we have a common trick: transforming the data. For instance, we might take logarithms of the data points to make a skewed distribution more symmetrical.

When data are on a numerical scale but don’t follow a normal distribution, and transforming the data to achieve normality isn’t feasible or desirable, we face a different challenge in statistical analysis. In such cases, it’s essential to employ tests that don’t rely on the assumption of normally distributed data to avoid misleading results. Conover (1971) highlights that parametric tests can still be approximately valid even when the data distribution only somewhat resembles a normal distribution. While the p-value might deviate slightly, it typically doesn’t stray enough to significantly impact the conclusions drawn. However, using parametric tests when the underlying assumptions aren’t even approximately valid can be risky. In such situations, hypothesis tests become sensitive not only to false hypotheses but also to false assumptions. The presence of severe skewness in the data exacerbates this challenge, making it imperative to choose appropriate non-parametric tests that can accommodate the data’s distribution without compromising the integrity of the analysis.

Non-parametric tests also become essential when dealing with data represented by ranks rather than numerical values. For instance, in surveys where respondents rank items like cars based on driveability, or prioritize shares according to profitability, or evaluate salary packages based on desirability, or even rank habitats based on suitability, non-parametric tests are the most suitable choice. These tests are designed to analyze data that doesn’t adhere to the assumptions of traditional parametric tests, making them ideal for situations where the data is ordinal rather than interval or ratio scaled. By focusing on the order or ranking of the data points rather than their specific values, non-parametric tests offer robust statistical analysis methods that accurately capture the underlying patterns and relationships present in the data. Therefore, when faced with scenarios involving ranked data, non-parametric tests emerge as the preferred approach for conducting meaningful and reliable statistical analyses.

Non-parametric tests, often referred to as distribution-free tests, offer an alternative approach to traditional parametric tests by relaxing some of their stringent assumptions. However, it’s important to note that while non-parametric tests have weaker assumptions compared to their parametric counterparts, they are not entirely assumption-free. For instance, they may still require similar parent distributions under the null hypothesis, particularly in cases involving two groups. Despite this, non-parametric tests provide valuable tools for analyzing data in situations where parametric assumptions are violated or when dealing with non-standard data types.

The realm of non-parametric tests encompasses a wide array of statistical methods, too numerous to cover comprehensively in this discussion. Only the most widely recognized tests will be highlighted here, but for a more exhaustive exploration of available tests, interested readers can refer to sources like Conover (1971) for additional options. By offering flexibility in statistical analysis and accommodating diverse data distributions and structures, non-parametric tests play a vital role in empirical research and data-driven decision-making across various fields.

The definition of non-parametric tests can vary across different literature sources, leading to some ambiguity. While some texts categorize tests for proportions and categorical data under the umbrella of non-parametric tests, others adopt a narrower definition. For instance, Conover (1971) delineates non-parametric tests as applicable to data measured on a nominal or ordinal scale, or on an interval scale where the distribution function cannot be specified, meaning we cannot assume the data follows a specific distribution such as the normal distribution.

In this discussion, we’ll consider non-parametric tests as those suitable for ranked data or situations where the assumption of a normal distribution cannot be made, particularly when the underlying distribution is skewed. Many non-parametric tests rely on ranking individual data points, resulting in distributions that exhibit greater symmetry compared to the original data. By accommodating diverse data types and distributions, non-parametric tests offer valuable tools for statistical analysis, particularly in scenarios where parametric assumptions are untenable or inappropriate.

Sign Test

As before, we start with the simplest situation, that of testing whether a single sample is compatible with a pre-specified value. This scenario is common in various real-world situations, such as comparing specific unit trust returns to market averages or assessing whether employees are neutral toward the implementation of affirmative action campaigns. The Sign Test emerges as a powerful tool in such cases, allowing researchers to evaluate the compatibility of a sample’s median with a hypothesized value without relying on distributional assumptions.

Basic Procedure:

  1. Null Hypothesis: The null hypothesis () posits that the median () of the population from which the sample is drawn equals a pre-specified value (). Examples include hypothesizing that a specific stretching exercise is not effective in preventing sports injuries or that respondents are neutral toward receiving ads via the internet.

  2. Calculating Differences: Subtract the pre-specified median () from each observation in the sample, counting positive and negative differences. For instance, in assessing whether lizards’ body temperatures match ambient temperatures, one would compare individual lizard temperatures to the ambient temperature.

  3. Test Statistic: The test statistic () is the smaller of the counts of positive and negative differences. This count represents deviations of sample observations from the pre-specified median.

  4. Hypothesis Testing: Compare the observed value of to critical values from the Sign Test table or calculate the p-value. If the observed value falls within the critical region or if the p-value is less than the chosen significance level, reject ; otherwise, fail to reject it.

Interpretation:

Rejecting the null hypothesis indicates that the sample’s median significantly differs from the pre-specified value, providing evidence for the research hypothesis. Conversely, failing to reject the null hypothesis suggests insufficient evidence for such a difference, warranting further investigation.

The Sign Test can also be extended to assess trends by splitting the data series into halves and examining pairs of observations. This method, known as the Cox and Stuart test for trend, offers insights into directional changes over time.

The Sign Test serves as a versatile and robust tool for assessing the compatibility of a sample’s median with a pre-specified value. By offering a straightforward approach that does not rely on distributional assumptions, the Sign Test empowers researchers to draw meaningful conclusions from their data in various real-world scenarios. Whether examining unit trust returns, employee attitudes, or environmental temperatures, the Sign Test provides a reliable method for hypothesis testing and trend analysis, facilitating evidence-based decision-making and furthering our understanding of the world around us.

Wilcoxon Sign Rank test

Building upon the principles of the sign test, the Wilcoxon signed rank test offers a more sophisticated approach for assessing the compatibility of a sample’s median with a specified value. In this test, researchers aim to determine if the population from which the sample is drawn aligns with a population having a median .

Basic Procedure:

  1. Calculating Differences: Begin by calculating the differences between each observation and the hypothesized median . These differences, denoted as , represent the deviations of individual observations from the hypothesized median.

  2. Ranking Differences: Disregard the signs of the differences for now and rank the absolute values of these differences from smallest to largest. Exclude differences that are zero (i.e., when ) from ranking.

  3. Incorporating Signs: Once ranked, reintroduce the signs of the differences onto the ranks. Assign ranks based on the absolute values of the differences while preserving the signs.

  4. Calculating the Test Statistic: The test statistic for the Wilcoxon signed rank test is obtained by summing the ranks corresponding to positive differences. This sum represents the Wilcoxon signed rank statistic.

Interpretation:

Unlike the sign test, which solely considers the number of observations above and below the median, the Wilcoxon signed rank test also accounts for the distances of observations from the median. Thus, it evaluates whether the distribution around the median is balanced, not only in terms of the number of observations but also in terms of their proximity to the median.

Handling Ties:

In cases where there are ties (i.e., observations with the same value), each tied observation is assigned the average rank. For example, if two observations share the same value, they are each given the average of the ranks they would have received if ranked separately.

Extension to Paired Observations:

The Wilcoxon signed rank test can also be applied to paired or related observations by considering the differences between paired observations. This extension allows researchers to assess differences between two related samples while accommodating non-parametric assumptions.

The Wilcoxon signed rank test offers a robust and versatile approach for assessing the compatibility of a sample’s median with a specified value. By considering both the number of observations above and below the median and their distances from the median, this test provides a more comprehensive evaluation of the data distribution. Whether applied to single samples or paired observations, the Wilcoxon signed rank test remains a valuable tool in statistical analysis, particularly in scenarios where parametric assumptions are untenable or impractical.

Continuing with Example 2.3, let’s rephrase it in the context of sales data before and after an advertising campaign:


Consider a small-scale study analyzing sales data before and after an advertising campaign for a new product. In this scenario, researchers are interested in evaluating whether the campaign led to a significant change in sales. Given the limited sample size, the assumption of normal distribution for sales data becomes crucial. However, if one is hesitant to make this assumption, resorting to a non-parametric test becomes necessary.

Here’s a hypothetical dataset representing sales figures before and after the campaign:

ObservationSales BeforeSales AfterDifferenceRankSign
1120150303+
290110202+
3110105-51
4130140104+
595105104+

In this table, each observation represents sales data for a specific period before and after the advertising campaign. The “Difference” column indicates the change in sales, while the “Rank” column denotes the ranking of these differences. Additionally, the “Sign” column indicates whether the difference is positive (+) or negative (-).

To conduct the Wilcoxon signed rank test, we calculate the sum of positive ranks, which in this case is 13 (as there’s only one negative observation). The null hypothesis () states that there is no difference in sales before and after the campaign, while the alternative hypothesis () suggests otherwise.

By referring to statistical tables or software tools, such as those provided by Conover or Brown and Hollander, researchers can determine the probability associated with the test statistic. For instance, the probability of observing a test statistic greater than or equal to 13 is found to be 0.062. Doubling this probability to account for the two-sided test yields 0.124, indicating that the null hypothesis is not rejected at the 5% or 10% significance levels.

Consequently, the analysis suggests that there is no significant change in sales before and after the advertising campaign. This outcome implies that the assumption of normality for the sales data was reasonable, reinforcing the validity of the findings.

Another Example:

Consider a study involving ten individuals diagnosed with a specific medical condition, who were matched with ten control individuals based on various factors such as age, sex, and social class. The researchers aimed to investigate the total time each participant had used a particular medication. Due to the matched nature of the samples, they are treated as paired rather than independent samples. The objective is to determine if there is a difference in medication usage between the two groups, thus requiring a two-sided test.

Here are the observed data:

PairCaseControlDifferenceRankSigned Rank
12.01.50.522
210.09.10.94.54.5
37.18.1-1.06-6
42.31.50.833
53.03.1-0.11-1
64.15.2-1.17-7
710.01.09.01010
810.59.60.94.54.5
912.17.64.588
1015.09.06.099

In this table, each pair represents a matched case-control pair, and the “Difference” column indicates the difference in medication usage between the case and control individuals. After ranking these differences and assigning signed ranks, we calculate the sum of positive ranks, which equals 41.

The p-value obtained from statistical tables is approximately 0.194, indicating that the null hypothesis, which suggests no difference in medication usage between the two groups, cannot be rejected at the 5% or 10% significance levels.

Alternatively, applying a t-test for paired samples yields a test statistic value of 1.93, resulting in a significance of 0.0856. However, it’s important to note that this significance is likely influenced by the notable differences in medication usage observed in pair number 7 (a difference of 9 units) and pair number 10 (a difference of 6 units), which skew the distribution.

Given the small sample size, the t-test may not adequately account for the skewed distribution caused by these outliers, leading to an overestimation of significance. Therefore, the findings from the Wilcoxon signed rank test are considered more reliable in this context.

When you’re unsure if the assumptions of a parametric test are suitable, it’s a good idea to double-check the results with a non-parametric test. If the conclusions from both tests don’t match, it’s important to think about why that might be. Keep in mind that non-parametric tests might not be as effective as parametric ones when the data follows a normal distribution, which means they could miss detecting actual differences.

One way to assess the distribution of your data is by creating a histogram. If the data appears very skewed, a non-parametric test is likely the better choice. However, if the skewness is mild, the parametric test is probably providing accurate results. Another method is to examine the coefficients of skewness and kurtosis. For a normal distribution, these coefficients are zero and three, although some software packages redefine kurtosis so that it’s zero for a normal distribution. While some packages offer tests for normality, they’re not always useful for small sample sizes.

Tables for statistical tests are typically available for small samples, defined as 15 or fewer observations. For larger samples, it’s known that the test statistic’s distribution is quite similar to the normal distribution. Hence, a “large sample approximation” can be used for the test statistic, based on the standard normal distribution. To apply this approximation, one calculates a standardized test statistic and compares it to the standard normal distribution tables. This method can be applied even to small samples, as many software programs compute p-values using this large sample approximation.

Example continuing: 

After calculating the sum of positive ranks as 41, we proceed to compute the approximately normally distributed test statistic:

27

Now, since we’re using a large sample approximation, we treat this statistic as standard normally distributed. The probability associated with this test statistic is approximately 0.16.

Comparing this probability to the one obtained using exact tables, which was 0.194, we observe that the large sample approximation holds up reasonably well even with only 10 cases in this example.

Mann-Whitney U Test and Wilcoxon Rank Sum Test

In statistical analysis, when we want to compare two independent samples to determine if they come from the same underlying population, we often turn to non-parametric tests like the Mann-Whitney U test and the Wilcoxon rank sum test. These tests are especially useful when the assumptions of parametric tests cannot be met, such as when the data is not normally distributed or when it contains outliers.

Principle and Assumptions

Both tests are based on the principle of comparing the ranks of observations between the two samples. If the two samples come from the same population, their ranks should be distributed similarly, with roughly equal overlap between the two distributions when superimposed. The tests do not assume any specific distribution for the data, making them robust and applicable in various scenarios.

Mann-Whitney U Test

The Mann-Whitney U test evaluates whether two independent samples have the same distribution or not. It works by ranking all observations from both samples together, then calculating the sum of ranks for each sample. The test statistic, U, is the smaller of these two sums. By comparing the calculated U value to critical values from statistical tables, we can determine the significance of the difference between the two samples.

Wilcoxon Rank Sum Test

The Wilcoxon rank sum test, also known as the Wilcoxon-Mann-Whitney test, is another non-parametric test for comparing two independent samples. Like the Mann-Whitney U test, it ranks all observations from both samples together. However, instead of using the smaller sum of ranks as the test statistic, the Wilcoxon rank sum test uses the sum of ranks from one of the samples. This choice reduces computational complexity, especially in the days when manual calculation was common.

Handling Tied Ranks

In cases where tied ranks occur, adjustments may be needed to the test statistic to ensure accuracy. However, most modern statistical software automatically accounts for tied ranks and adjusts the test statistic accordingly.

The Mann-Whitney U test and the Wilcoxon rank sum test are valuable tools for comparing two independent samples when the assumptions of parametric tests cannot be met. They provide robust and reliable results, making them widely used in various fields of research and analysis.

Example 6.2:

Suppose we have measurements of the thickness of the kidney wall in ten-year-olds with bacterial kidney infection compared to those without the infection.

Let’s rephrase the problem: We are comparing the number of shares issued in different months before and after a slump in the stock exchange.

  • Diseased: 10, 13, 14, 15, 22
  • Non-diseased: 18, 22, 23, 24, 25, 27, 27, 31, 34

We rank the observations:

  • Diseased: 1, 2, 3, 4, 6.5
  • Non-diseased: 5, 6.5, 8, 9, 10, 11.5, 11.5, 13, 14

Null and Alternative Hypotheses:

  • : There is no difference in the thickness of the kidney wall between ten-year-olds with and without the disease.
  • : The disease reduces the distance.

The sum of ranks for the smaller sample is 16.5. To assess the significance of this difference, we consult statistical tables. Under the null hypothesis, the probability that the statistic is less than or equal to 17 is found to be 0.002. Thus, we reject the null hypothesis at the 5% (and 1%) levels.

Since tables are only available for small samples (up to 10), we can use a large sample approximation based on the standard normal distribution. The formula for the statistic is:

bTagb

Here, is the original Wilcoxon test statistic, is the size of the smaller sample, and is the size of the larger sample. We calculate and using the sample sizes and compare the resulting statistic to standard normal tables for significance assessment.

Example 6.3:

Suppose we have measurements of a specific urinary component in subjects with and without carcinoid heart disease.

Let’s rephrase the problem: We are comparing the returns on investments for industrial shares listed on the JSE and for financial shares, or prices paid for a cup of coffee in a number of shops in two different cities, or the total number of hours spent studying by BSc and BA students.

  • Diseased: 263, 288, 432, 890, 450, 1270, 220, 350, 283, 274, 580, 285, 524, 135, 500, 120
  • Nondiseased: 60, 119, 153, 588, 124, 196, 14, 23, 43, 854, 400, 73

We rank the observations:

  • Diseased: 13, 17, 20, 27, 21, 28, 12, 18, 15, 14, 24, 16, 23, 9, 22, 7
  • Nondiseased: 4, 6, 10, 25, 8, 11, 1, 2, 3, 26, 19, 5

Hypotheses:

  • : There is no difference in the amount of the urinary component.
  • : There is a difference.

The sum of ranks for the smaller sample is 120. Using a large sample approximation, we obtain a test statistic of -2.507, which gives a p-value of 0.0122. Therefore, we reject the null hypothesis in favor of the alternative hypothesis.

Using the two-sample t-test on this set of data yields a p-value of 0.0632, so the null hypothesis cannot be rejected at the 5% level. The discrepancy in conclusions arises because the data for both samples are positively skewed. In cases where the assumptions of the t-test are violated, the Wilcoxon test is more powerful.

McNemar test

The McNemar test is a statistical method used when dealing with two dependent or related samples, particularly when the responses are categorical or nominal in nature, such as yes/no, success/failure, or present/absent. It’s designed to assess whether there is a significant change in proportions or frequencies between paired observations within the same sample.

Here’s a comprehensive description of the McNemar test:

Purpose:

  • The McNemar test helps determine whether there’s a significant difference in proportions or frequencies between paired observations within the same sample.

 

Assumptions:

  • The test assumes that the paired observations are dependent or related.
  • It’s suitable for nominal or categorical data with two levels.
  • It’s important that the samples being compared are matched pairs, such as before-and-after measurements or data from a repeated-measures design.

 

Hypotheses:

  • The null hypothesis () states that there is no difference in proportions or frequencies between the paired observations.
  • The alternative hypothesis () suggests that there is a significant difference between the paired observations.

 

Test Procedure:

  • To conduct the McNemar test, you first create a 2×2 contingency table, where each cell represents a combination of responses for the two paired samples.

  • The table is structured as follows:

    BeforeAfterYesabNocd

    The McNemar test statistic is calculated as:

    x2=(bc)2b+c

  • This test statistic follows a chi-square distribution with 1 degree of freedom under the null hypothesis.

 

 Interpretation:

  • If the calculated chi-square value is significant at a chosen level of significance (usually ), then we reject the null hypothesis.
  • A significant result indicates that there is a significant difference in proportions or frequencies between the paired observations.

 

Applications:

  • Common applications include analyzing the effectiveness of interventions or treatments in before-and-after studies, assessing the impact of an educational program, or evaluating the performance of classifiers in machine learning.

 

Limitations:

  • The McNemar test is specifically designed for paired data and is not suitable for independent samples.
  • It assumes that the paired observations are independent of other pairs.

The McNemar test is a valuable tool for comparing proportions or frequencies between paired observations, particularly when dealing with categorical data on a nominal scale.

Example 6.4:

Just before a national election, several polls are typically conducted to estimate the proportion of votes each party is expected to receive. Suppose a group of individuals is interviewed a fortnight before the election, and the same group is interviewed again three days before the election. We can then test whether there has been a shift in support towards one of the parties. Consider the following data:

 Party 1Party 2Total
Survey 1632083
Survey 230535
Total9325118

We can represent this table as having the entries:

 Party 1Party 2Total
Survey 2aba+b
Party 2cdc+d
Totala+cb+d118

While this table resembles the 2×2 contingency table used in the proportion section, there are some distinctions. We are not interested in determining whether there is an association between people’s voting preferences at time 1 and time 2. Instead, our focus is on testing whether the number of individuals switching from Party 1 to Party 2 is the same as the number switching from Party 2 to Party 1. The individuals who do not change their preference are irrelevant to this analysis, as they provide no information about the individuals who do change. Our objective is simply to determine whether the proportion in the cell labeled is equal to the proportion in the cell labeled . when , the McNemar test statistic can be compared to tables (as suggested by Conover) to determine statistical significance.

However, if , an approximation using a chi-squared distribution with 1 degree of freedom may be employed.

The calculation of the p-value using the chi-squared distribution with 1 degree of freedom yields 0.1573. Testing at the 5% level of significance, we observe that the obtained p-value exceeds the threshold. Consequently, we do not have sufficient evidence to reject the null hypothesis, indicating that there is no significant difference in this case.

Practical Example: 

Suppose a company implements a new software system to streamline workflow processes. Before the implementation, employees are asked whether they encounter difficulties completing their tasks within a specific time frame. After the software is implemented, the same set of employees are surveyed again to determine if the new system has improved task completion efficiency. Each employee is classified as to whether they encountered difficulties both before and after the software implementation, encountered difficulties only before, encountered difficulties only after, or encountered no difficulties at all. In this scenario, the McNemar test can be used to assess whether the proportion of employees encountering difficulties has significantly decreased after the implementation of the new software system.

Kolmogorov-Smirnov tests

The Kolmogorov-Smirnov test is a statistical method used to compare two probability distributions and determine if they are significantly different from each other. Unlike other tests that focus on specific parameters like mean or variance, the Kolmogorov-Smirnov test examines the overall shape and characteristics of the distributions.

Here’s how the test works:

Principle: The test is based on the idea that if two populations have identical distributions, their cumulative distribution functions (CDFs) should be very similar. The CDF represents the proportion of observations that are less than or equal to a certain value. By comparing the CDFs of the two samples, we can assess the similarity of the underlying distributions.

 

Procedure:

    • Calculate the empirical cumulative distribution functions (ECDFs) for both samples.
    • Plot the ECDFs on the same graph.
    • Compute the maximum vertical difference between the two ECDFs. This is known as the Kolmogorov-Smirnov statistic (D).
    • Compare the observed D value to critical values from the Kolmogorov-Smirnov distribution to determine statistical significance.

       

    • Interpretation:

      • If the observed D value is greater than the critical value at a chosen significance level (e.g., 0.05), then we reject the null hypothesis of identical distributions.
      • If the observed D value is less than the critical value, we fail to reject the null hypothesis, indicating that there is no significant difference between the distributions.

         

      • Assumptions:

        • The samples are independent.
        • The data are continuous or are treated as continuous.
        • The distributions being compared are fully specified (i.e., no parameters are estimated from the data).

Example Application:

    • Suppose we have two sets of data representing the heights of students from two different schools. We want to determine if there is a significant difference in the height distributions between the two schools. By conducting a Kolmogorov-Smirnov test, we can assess whether the observed differences in the ECDFs are statistically significant, indicating differences in the underlying height distributions.

 

The Kolmogorov-Smirnov test provides a way to compare entire distributions rather than specific parameters, making it a valuable tool for assessing the overall similarity or difference between two datasets.

Example 5.6:

Let’s say we have two groups of data, one with 15 measurements and the other with 9. We want to see if these groups come from the same source. Here are the values we have:

Group 1: 7.6, 8.4, 8.6, 8.7, 9.3, 9.9, 10.1, 10.6, 11.2 Group 2: 5.2, 5.7, 5.9, 6.5, 6.8, 8.2, 9.1, 9.8, 10.8, 11.3, 11.5, 12.3, 12.5, 13.4, 14.6

To understand the distribution of Group 1, let’s look at the proportions of values below certain points. For example, none of the values are less than 7.6, one value out of 9 is less than 8.4, two values are less than 8.6, and so on.

To find the maximum difference between the cumulative distribution functions of the two samples, we first rank all the observations combined from both samples. Then, we calculate the proportion of observations below each rank for each sample. The differences between these proportions are then computed, disregarding the sign (i.e., taking the absolute values). The test statistic is the largest absolute difference.

For example, let’s look at the calculations for the given data:

 

  • For the first sample:

    • 5.2 has a cumulative proportion of 1/15.
    • 5.7 has a cumulative proportion of 2/15.
    • And so on until 14.6 with a cumulative proportion of 15/15.
  • For the second sample:

    • 7.6 has a cumulative proportion of 1/9.
    • 8.2 has a cumulative proportion of 2/9.
    • And so on until 14.6 with a cumulative proportion of 9/9.

 

Then, we calculate the absolute differences between the cumulative proportions for each observation. The maximum absolute difference is found to be 18/45 or 0.40.

By comparing this value to the cutoff value from tables (for a 95% confidence level), which is 8/15, we determine whether the hypothesis of no difference can be rejected. In this case, with the given sample sizes, the hypothesis cannot be rejected.

Permutations

Permutation tests, sometimes referred to as re-randomization tests, provide a powerful and flexible approach to hypothesis testing, especially when traditional parametric assumptions cannot be met or when sample sizes are small. The core idea behind permutation tests is to assess whether observed differences between two or more groups could have occurred by chance alone, assuming there is no true difference between the populations they represent.

Here’s a comprehensive overview of how permutation tests work and their key features:

Rationale: The fundamental concept underlying permutation tests is straightforward: if there is truly no difference between the groups being compared, then any specific assignment of observations to different groups is equally plausible. In other words, the observed differences between groups could have arisen randomly.

Procedure:

    • Data Preparation: Begin with the observed data, typically consisting of measurements or observations from two or more groups.
    • Null Hypothesis: Formulate the null hypothesis, which states that there is no difference between the groups.
    • Test Statistic: Choose an appropriate test statistic that quantifies the difference between groups. This could be the mean, median, sum, or any other relevant measure.
    • Permutation: Randomly shuffle or reassign the observations to different groups, maintaining the original sample sizes for each group. This creates new datasets under the assumption of the null hypothesis.
    • Compute Test Statistic: Calculate the test statistic for each permuted dataset.
    • Comparison: Compare the test statistic from the original data to the distribution of test statistics obtained from the permuted datasets.
    • P-value: Calculate the p-value as the proportion of permuted datasets that produce a test statistic as extreme as or more extreme than the one observed in the original data.
    • Inference: Draw conclusions based on the p-value and the chosen significance level.

Advantages:

    • Distribution-free: Permutation tests do not rely on specific assumptions about the distribution of the data, making them robust and applicable to a wide range of scenarios.
    • Flexible: They can be applied to various types of data and experimental designs, including paired, independent, and factorial designs.
    • Accurate: Permutation tests provide exact, non-parametric p-values, avoiding reliance on asymptotic approximations.

Example: Suppose we want to compare the effectiveness of two teaching methods (A and B) in improving students’ test scores. We collect test scores from two groups of students: one taught using method A and the other using method B. To conduct a permutation test, we would:

    • Define the null hypothesis as “There is no difference in test scores between students taught with method A and those taught with method B.”
    • Choose a test statistic, such as the difference in mean test scores between the two groups.
    • Randomly permute the test scores between the two groups, compute the test statistic for each permutation, and build a distribution of test statistics.
    • Compare the observed test statistic to the distribution of permuted test statistics and calculate the p-value.
    • Draw conclusions based on the p-value, indicating whether there is sufficient evidence to reject the null hypothesis.

Considerations:

    • Computational Intensity: Permutation tests can be computationally intensive, especially for large datasets or when conducting a large number of permutations.
    • Sample Size: While permutation tests can be applied to small sample sizes, their power may be limited compared to parametric tests with larger samples.
    • Interpretation: Interpretation of results requires careful consideration of the experimental design, test statistic chosen, and assumptions made.


Permutation tests offer a versatile and reliable approach to hypothesis testing, particularly in situations where parametric assumptions are not met or when traditional tests are not applicable. Their flexibility and robustness make them valuable tools in statistical analysis and hypothesis testing across various fields of research.

Tests for Several Independent Samples

In statistical analysis, the need often arises to compare multiple groups to determine if there are any significant differences among them. This scenario occurs when a variable is measured across several distinct groups, such as different treatment options, product preferences, or performance ratings across various categories. To address this, we employ tests specifically designed for comparing means or distributions across multiple independent samples.

Kruskal-Wallis Test

The Kruskal-Wallis test is a non-parametric method used to determine whether there are statistically significant differences between the means of three or more independent groups. It’s an extension of the Mann-Whitney U test, which is used for comparing two independent samples. The Kruskal-Wallis test assesses whether the distributions of the groups’ observations differ significantly without making assumptions about the underlying distribution of the data.

How it Works:

  1. Ranking: The first step involves ranking all the observations from all groups combined. Ties are handled by assigning the average rank to tied values.
  2. Calculation of Average Ranks: Next, the average rank for each group is calculated. This is done by averaging the ranks of all observations within each group.
  3. Test Statistic: The test statistic, denoted as KW, is computed based on the average ranks of the groups and their sample sizes. It measures the degree of difference between the groups’ distributions.
    • For samples without tied ranks, the formula involves sums of ranks and sample sizes.
    • If tied ranks are present, the formula is adjusted accordingly.

Comparison with Critical Values: The test statistic is compared to critical values from appropriate tables or calculated using the chi-squared distribution. For small sample sizes, tables specific to the Kruskal-Wallis test are used, while for larger samples, the test statistic may be compared to chi-squared distribution tables with degrees of freedom equal to (k-1), where k represents the number of groups being compared.

 

Interpretation:

  • If the calculated test statistic exceeds the critical value at a chosen significance level (usually 0.05), then the null hypothesis of no significant difference between group means is rejected.
  • Conversely, if the test statistic does not exceed the critical value, there is insufficient evidence to reject the null hypothesis, indicating no significant difference between group means.

Advantages:

  • Suitable for data that violate the assumptions of parametric tests (e.g., normality, homogeneity of variance).
  • Robust to outliers and non-normal distributions.
  • Allows for comparisons among multiple groups simultaneously.

Applications:

  • Commonly used in various fields, including medicine, social sciences, and business, where researchers need to compare means across multiple groups without making strong distributional assumptions.

The Kruskal-Wallis test is a valuable tool for assessing differences among three or more independent groups when parametric assumptions are not met or when dealing with ordinal or non-normally distributed data.

 

Note:

While the test is often referred to in the context of comparing means, it’s important to note that the Kruskal-Wallis test is a non-parametric test, meaning it does not make assumptions about the underlying distribution of the data.

Since the Kruskal-Wallis test is non-parametric, it does not directly compare means; instead, it compares the distributions of the ranked data across the groups. Therefore, when we refer to differences between groups, it’s more accurate to say that the test is assessing differences in the central tendencies of the groups, which could include means or medians, depending on the nature of the data.

In situations where the data are normally distributed and the groups have similar variances, comparing means using parametric tests like ANOVA (Analysis of Variance) may be appropriate. However, when these assumptions are violated, or when dealing with ordinal or non-normally distributed data, the Kruskal-Wallis test is a robust alternative that compares the overall distributions of the groups, including their central tendencies (which could be means or medians).

Therefore, while the Kruskal-Wallis test is often used to compare medians, it can also be applied to compare means or any other measure of central tendency, depending on the nature of the data and the assumptions being made.

Example 6.7: Comparing Yields of Different Types of Maize

In this example, we’re examining the yields of different types of maize, which can be generalized to various scenarios like financial yields in different sectors, sales of products across regions, or even cell counts in patients with different treatments.

Data:

We have yields for four types of maize, each with its corresponding rank:

Type of MaizeYieldRank
1834
1919
19210
1908
29411
2886
29612
2897
2845
310114
310013
3812
3823
4781

Analysis:

The Kruskal-Wallis test is employed since normality assumptions are questionable, especially with small sample sizes. The test statistic is computed to be 9.45, with a resulting p-value of 0.0239, indicating a significant difference in mean yields among the maize types.

Next, pairwise comparisons are conducted to identify specific differences between maize types. The critical value is calculated based on the test statistic, and differences between average ranks are assessed. Results indicate significant differences between all maize types except types 1 and 2.

Conclusion:

Maize types 1 and 2 do not significantly differ from each other, while types 3 and 4 differ significantly from all other types. A graphical representation can be used to visualize these differences effectively.

 

Testing Several Related Samples

In various research scenarios, measurements on different groups are taken across multiple areas or conditions. For instance:

  • Different types of maize grown on multiple farms.
  • Ratings provided to various brands of ice-cream by multiple tasters.
  • Various methods of measuring something applied to samples from individuals, animals, plants, or units.

Pairing Consideration:

In such cases, it’s essential to account for the pairing of measurements from the same individual, plant, animal, or unit. This generalizes the paired sample t-test.

Non-parametric Tests:

Two common non-parametric tests used in this scenario when the response variable is on a continuous scale are the Quade test and Friedman test.

  • Quade Test: An extension of the Wilcoxon signed ranks test, often more powerful for fewer than 5 groups.
  • Friedman Test: An extension of the sign test, generally more powerful for more than 4 groups.

Procedure:

  1. Obtain a single reading for each treatment within each block.
  2. Rank the values within each block.
  3. Calculate the test statistic.

Application:

For example, when multiple raters provide ratings for different ice-cream brands, each rater’s ratings are ranked independently.

Multiple Comparison:

If the tests are significant, multiple comparison tests are conducted to determine specific differences between treatments. Conover provides details on these tests, although they are not readily available in some statistical software packages.

Nominal Scale Data:

For data on a nominal scale (e.g., yes/no), the Cochran Q test is appropriate.

Lesson Summary

In this section on non-parametric tests, we explored statistical methods suitable for situations where traditional parametric assumptions might not hold. We began by discussing the Mann-Whitney U test, a powerful tool for comparing two independent groups when the data are ordinal or continuous but not normally distributed. Next, we delved into the Wilcoxon signed-rank test, designed for paired samples to assess differences between related groups. This test is particularly useful when analyzing pre-test and post-test measurements or before-and-after intervention data. We also examined the Kruskal-Wallis test, an extension of the Mann-Whitney U test, allowing comparison of multiple independent groups. This method is valuable when exploring differences among several treatments or conditions. Additionally, we discussed the Friedman test, a non-parametric alternative to repeated measures ANOVA, suitable for analyzing data with repeated measures across multiple conditions or time points. Finally, we touched upon the Cochran Q test, designed for nominal data to assess differences in proportions across multiple related groups. Overall, these non-parametric tests offer robust alternatives to traditional parametric methods, providing valuable insights into various research scenarios while accommodating diverse data distributions and study designs.

References

Confused! Click below to get a tutor
WhatsApp chat