ABSTRACT

Objective:

The use of cross tables was frequently seen in early literature research in the biostatistics. Furthermore, its importance in many clinical examinations is still evident today. The aim of this study is to investigate how the 2x2 type tables are perceived in probability literature and how some studies are applied in practice. Thus, different methods can be developed for the purposes of applications.

Methods:

The method used to determine the distribution of a 2x2 type table is to consider one cell of a table as a random variable and calculate the probability that this variable can take the observed value. Hypergeometric distribution was taken into consideration in the study. This issue is explained in the methodology section of the study.

Results:

Some of the important statistics obtained from 2x2 type tables are the numerical statistical values that direct the researcher in experimental studies such as odds ratio. Considering the distribution of the table, the probabilities of these values are a very important finding for the experimental study. In particular, a high probability value is a measure of how well the statistical value commonly used in biostatistics applications, such as the odds ratio, represents the experimental study performed.

Conclusion:

According to the findings of the study, one of the observed results is the determination of the maximum probability ratio representing the experimental study, and the other is the weighted odds ratios that are used to combine odds ratios in the meta-analysis.

Keywords:

Contingency table, probability distribution, hypergeometric distribution, odds ratio, relative risk, meta-analysis

Introduction

One of the problems encountered in scientific research is the inadequacy of data. This can be due to the rarity of data, as well as the lack of time and cost or the lack of specialized personnel. For this reason, especially in health researches, clinical trials and studies are undertaken on a limited number of units. Sometimes, it is necessary to work with small samples for ethical purposes. In such a case, combining studies with similar characteristics by different researchers may make the study findings more meaningful. For these reasons, developing suitable combination methods is necessary.

The most striking example of this is the combination of odds ratios (ψ). Odds ratio combining methods in the literature are Mantel-Haenszel, Peto, General Variance, and DerSimonian-Laird methods. Detailed information on these methods can be found in Katz et al. (1) and Morris and Gardner (2). In these studies, important information is given about establishing confidence intervals of odds ratio. The normal distribution was used to establish the confidence interval. However, the condition of normal distribution may not always be possible. In this case, it is important to determine the distribution of odds ratio. No study in the literature has reported the distribution of odds ratio. However, a distribution that can be used in contingency tables has been examined by Patnaik (3) and Stevens (4). Studies of these researchers will be given with examples in the following sections. These examples were very useful in calculating the distribution of odds ratio. The distribution of odds ratio will be shown in the example in the Results section of our study. In addition, the distribution of combined odds ratios will be calculated in real data application.

Some Probabilistic Notes on Contingency Tables

In biostatistics, the statistical methods which are frequently used in both retrospective and prospective studies are based on statistics such as relative risk and odds ratio obtained from the information in Table 1. Therefore, it is very important to examine the probabilistic features of this table.

As retrospective study is limited to observed data, experimental values are fixed. However, it does not mean that it cannot vary depending on the retrospective follow-up period or other reasons. Thus, the value of a in the table is the observation value of X ₁. The same applies to the control group. The probability Pr{X ₁ = a} can be calculated by the ratio of desired states to all possible states, as in the hypergeometric probability. The number of possible states is written as follows,

                  N !
      _______________
          a !b !c !d ! 

The number of desired states can be calculated as follows,

$(\begin{matrix} (m) \\ a \end{matrix}) (\begin{matrix} n \\ d \end{matrix}) (\begin{matrix} r \\ c \end{matrix}) (\begin{matrix} s \\ b \end{matrix}) = \frac{m! n! r! s!}{a! (m - a)! d! (n - d)! c! (r - c)! b! (s - b)!}$

We have

$P r {X_{1} = a} = \frac{m! n! r! s!}{N! a! b! c! d!}$

where max{0, m + r − N} ≤ X ₁ ≤ min{m, r}.

Sample 1. Consider the data in Table 2

In this example, the variable X ₁ takes values between 0 ≤ X ₁ ≤ 10. Let us show the possible states and probabilities of variable X ₁ in Table 3.

According to Table 3, the probability that X ₁ takes the value of 5 is the highest probability. The graph of the probability values in Table 3 is as follows,

Methodology

In literature, the first study about this probability belongs to P. B. Patnaik in 1948. In the study, the common cell of the case and the positive effect was accepted as a random variable, and it was shown by P. B. Patnaik that it has a hypergeometric distribution. This makes it easier to obtain the term representing the odds ratio from the conditional probability of the hypergeometric distribution. Therefore, hypergeometric distribution was taken into consideration in the study. Patnaik calculated the mean and variance of the distribution with the help of the hypergeometric distribution as E X ₁ = mr / N and Var X ₁ = mnrs / N 2(N − 1) [3]. The mean E X ₁ calculated by Patnaik is used as the expected value of the cells in the chi-square relationship test. This was followed by W. L. Stevens. Stevens assumed the conditional probability of the variable X ₁ as a function of a under the condition that all marginal totals are known. As follows [4],

$P r {X 1 = a | X_{1} + X_{2} =} r, m = C (\begin{matrix} m \\ a \end{matrix}) (\begin{matrix} n \\ r - a \end{matrix}) ψ^{a}$

where ψ =ad / bc is odds ratio. The conditional probability mentioned above can be obtained as the multiplication of two binomial probabilities,

Pr{X ₁ = a | X ₁ + X ₂ = r, m} = Pr{X ₁ = a}Pr{X ₂ = c}

$= (\begin{matrix} m \\ α \end{matrix}) {p_{1}}^{α} {q_{1}}^{m - α} (\begin{matrix} n \\ c \end{matrix}) {p_{2}}^{c} {q_{2}}^{n - c} = \frac{q_{1}^{m} q_{2}^{n} p_{2}^{r}}{q_{2}} (\begin{matrix} m \\ α \end{matrix}) (\begin{matrix} n \\ r - α \end{matrix}) ψ^{α}$

where p ₁ and p ₂ are the probability of success in case and control groups, respectively. In addition, the ratio p ₁ / p ₂ is called relative risk. As a result, this equation ensures that conditional probability can be written as a function of a. This is an important result for 2x2 tables. If the observation value of variable X ₁ is smaller than some values in the possible order, ψ will remain smaller than odds ratios of these values, otherwise vice versa. Using this feature, Jerome Cornfield formed confidence interval with 1 − α probability for odds ratio in his study done in 1956 (5). Cornfield obtained the lower limit ψ ₁ for ψ from the solution of the following equation,

$\sum_{y = X_{1}}^{m} c (\begin{matrix} m \\ α \end{matrix}) (\begin{matrix} n \\ r - α \end{matrix}) ψ^{α} = \frac{^{α}}{2}$

Similarly, he obtained the upper limit ψ ₂ for ψ from the solution of the following equation,

$\sum_{y = 0}^{X_{1}} c (\begin{matrix} m \\ α \end{matrix}) (\begin{matrix} n \\ r - α \end{matrix}) ψ^{α} = \frac{^{α}}{2}$

Thus, the confidence interval can be written as Pr{ψ ₁ ≤ ψ ≤ ψ ₂} = 1 − α.

Results and Discussion

Here, conditional probability is obtained as the multiplication of two binomial distributions by independent variables X ₁ and X ₂. Then normal distribution test procedures can be used in hypothesis tests since the limit distributions X ₁ and X ₂ approach normal distribution. However, this may be the case if the marginal totals are large enough. Otherwise, it may cause incorrect interpretations. It is more accurate to obtain the exact distribution and to test with nonparametric method when an exact test statistic for ψ is desired to be created. In order for the mean and variance of ψ to be real, it is sufficient for the cells to satisfy the conditions of a < m and c > 0. In this case, it is necessary to obtain the conditional distribution of ψ depend on these conditions. Therefore, many researchers use the normal distribution approach. The conditional distribution can be obtained by dividing binomial probabilities to the probability of Pr{X ₁ < m} for X ₁ and to the probability of Pr{X 2 > 0} for X ₂. In the following example, we show the possible values and possibilities of ψ.

Sample 2. Let be the sample data as follows,

The multiplication probability table and the probability table of ψ can be formed with the data in Table 4 using the conditional probabilities of variables X ₁ and X ₂,

The graph of multiplication probabilities in Table 5 is as follows,

When the probability in Table 6 is taken into consideration, the variable is seen to have the highest probability at ψ = 4. It is seen that odds ratio in the experimental data in Table 4 would take high probability value between 2 and 5 (Pr{2 ≤ ψ ≤ 5} = 0.443). Such probabilistic information can also be supported in statistical terms by creating rejection and acceptance zones from the distribution obtained at α significance level. Moreover, the mean Eψ = 6.7133 obtained from the distribution is an important statistic for ψ. The graph of probabilities in Table 6 is as follows.

Table 6 shows the distribution of ψ. The distribution of ψ can be easily obtained when multiple tables are for the same X ₁. Let’s assume that there are k tables of X ₁. In this case, probabilities for each table are shown as follows,

Pr{ψ _j = u} = p_j(u), j = 1, ⋯ , k.

The distribution of all tables will be as follows,

$P r {ψ = u} = \frac{1}{k} \sum_{u}^{} p_{j} (u), j = 1, . . ., k .$

Since the mean E is derived from the k sample selected from the mass, it will be able to represent the mass ideally. Finally, the following sample about combined odds ratio are presented.

The following example table was taken from Afshari et al.(6). This meta-analysis study by Mahdi investigates the effect of opium and smoking on bladder cancer. Table 7 was created by considering only opium use. The distribution and expected value of odds ratio were obtained for each study. At the end of the table is the expected value of the combined odds ratio. The matlab program used in the calculation is attached.

Conclusion

In general, when we look at the studies in the field of biostatistics, a comprehensive and technically rich literature is emerging. This is due to the fact that many scientific techniques are combined with medical data gathered under biostatistics. A scientific technique needs not only an opinion, but also an interpretation. The interpretation to be made is usually attributed to the data. However, this interpretation is the common point of data and technique, which increases the scientific value of results obtained from data and importance of the technique used. Therefore, biostatistics studies are important studies that bring data and technique together. If the odds ratio value obtained from a data in Table 1 is smaller than 1, the factor decreases the risk of disease. If the odds ratio is equal to 1, the factor has no effect on the disease. If the odds ratio is bigger than 1, the factor increases the risk of the disease. Thus, X 1 is the most important variable to ensure a high odds ratio. Considering the coincidence of the value of X 1, it is more important to know the maximum probability value. For this, the distribution of X 1 and its interpretation should be made. In the study, a data table of 2 × 2 type has been shown to have hypergeometric distribution when considered unconditionally. Depending on this distribution, the variable X 1 takes the maximum probability with value (m + 1)(n + 1) / (N + 2). In addition, this value is the maximum probability value of the odds ratio. This result is very important in terms of both data and theory. If data were not interpreted with the theoretical structure, then a conclusion will never be obtained. Similarly, obtaining the distribution of ψ is also important in terms of interpretation. When values obtained from different tables are combined in a probability distribution, the distribution of a single variable ψ can be obtained for all tables. This result is also very important for meta-analysis. The mean Eψ for a single table is so important for combined tables. Many methods have been presented to combine odds ratios in literature; however, no such method has been presented. The reason for this is that presented methods have the ease of calculation in terms of researchers. However, using probabilistic methods is more important for more optimal results. Finally, one point that should be taken into consideration is that if the number of case and control is sufficient in a 2x2 table, parametric methods can be used easily. An example would be the Mantel-Haenszel, Peto, General Variance, and DerSimonian-Laird methods. If the number of data is quite low, it is more appropriate to use probabilistic methods.

Ethics

Ethics Committee Approval: There is no approval of the Ethics Committee, since there is no “animal or human element” in our study, and the study was completely conducted on hypothetical theoretical data.

Peer-review: Externally peer reviewed.

Authorship Contributions

Concept: M.G., M.O.K., Y.G., Design: M.G., M.O.K., Analysis or Interpretation: M.G., M.O.K., Y.G., Literature Search: M.G., Writing: M.G., Y.G.

Conflict of Interest: No conflict of interest was declared by the authors.

Financial Disclosure: The authors declared that this study received no financial support.

References

Katz D, Baptista J,Azen SP, Pike MC. Obtaining Confidence Intervals for the Risk Ratio in Cohort Studies. BIOMETRICS 1978;34:469-74.

Morris JA, Gardner MJ. Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. Br Med J (Clin Res Ed) 1988;296:1313-6.

PATNAIK PB. The power function of the test for the difference between two proportions in a 2 X 2 table. Biometrika 1948;35:157-75.

Stevens WL. Mean and variance of an entry in a contingency table. Biometrika 1951;38:468-70.

Cornfield J. A Statistical Problem Arising from Retrospective Studies. Third Berkeley Symp. on Math. Statist. and Prob 4. 1956; p. 135-48.

Afshari M, Janbabaei G, Bahrami MA, Moosazadeh M. Opium and bladder cancer: A systematic review and meta-analysis of the odds ratios for opium use and the risk of bladder cancer. PLoS One 2017;12:0178527.

Analyzing the Odds Ratio Via Distribution Function