Quantitative Aptitude > Data Interpretation

Add Comment Bookmark share + Refresher Material


Graphical Methods for Describing Data

  • In data analysis, a variable is any characteristic that can vary for the population of individuals or objects being analyzed. For e.g. both gender and age represent variables among people.
  • Data are collected from a population after observing either a single variable or observing more than one variable simultaneously. The distribution of a variable, or distribution of data, indicates the values of the variable and how frequently the values are observed in the data.

Frequency Distributions

  • The frequency, or count, of a particular category or numerical value is the number of times that the category or value appears in the data.
  • A frequency distribution is a table or graph that presents the categories or numerical values along with their associated frequencies.
  • The relative frequency of a category or a numerical value is the associated frequency divided by the total number of data. Relative frequencies may be expressed in terms of percents, fractions, or decimals.
  • A relative frequency distribution is a table or graph that presents the relative frequencies of the categories or numerical values.

    Ex.  A survey was taken to find the number of children in each of 25 families. A list of the values collected in the survey follows.

    Here are the resulting frequency and relative frequency distributions of the data.

    Note that the total for the relative frequencies is 100%. If decimals were used instead of percents, the total would be 1. The sum of the relative frequencies in a relative frequency distribution is always 1.

Bar Graphs

  • In a bar graph, rectangular bars are used to represent the categories of the data, and the height of each bar is proportional to the corresponding frequency or relative frequency.
  • All of the bars are drawn with the same width, and the bars can be presented either vertically or horizontally. Bar graphs enable comparisons across several categories, making it easy to identify frequently and infrequently occurring categories.

    Ex

    From the graph, we can conclude that the college with the greatest fall 2009 enrollment was College E, and the college with the least enrollment was College A. Also, we can estimate that the enrollment for College D was about 6,400.

  • A segmented bar graph is used to show how different subgroups or subcategories contribute to an entire group or category. In a segmented bar graph, each bar represents a category that consists of more than one subcategory. Each bar is divided into segments that represent the different subcategories. The height of each segment is proportional to the frequency or relative frequency of the subcategory that the segment represents.

    Ex.

  • Different values can be estimated from the segmented bar graph above. For example, for College D, the total enrollment was approximately  students, the part-time enrollment was approximately , and the full-time enrollment was approximately , or  students.
  • Bar graphs can also be used to compare different groups using the same categories.

    Ex.  

    Observe that for all three colleges, the fall 2009 enrollment was greater than the spring 2010 enrollment. Also, the greatest decrease in the enrollment from fall 2009 to spring 2010 occurred at College B.

Circle Graphs

  • Circle graphs, often called pie charts, are used to represent data with a relatively small number of categories. They illustrate how a whole is separated into parts. The area of the circle graph representing each category is proportional to the part of the whole that the category represents.

    Ex. 

    The graph shows that of all United States production of photographic equipment and supplies in 1971, Sensitized Goods was the category with the greatest dollar value. 

  • Each part of a circle graph is called a sector. Because the area of each sector is proportional to the percent of the whole that the sector represents, the measure of the central angle of a sector is proportional to the percent of 360 degrees that the sector represents. For e.g. the measure of the central angle of the sector representing the category Prepared Photochemicals is 7 percent of 360 degrees, or 25.2 degrees.

Histograms

  • When a list of data is large and contains many different values of a numerical variable, it is useful to organize it by grouping the values into intervals, often called classes. To do this, divide the entire interval of values into smaller intervals of equal length and then count the values that fall into each interval. In this way, each interval has a frequency and a relative frequency. The intervals and their frequencies (or relative frequencies) are often displayed in a histogram.
  • Histograms are graphs of frequency distributions that are similar to bar graphs, but they have a number line for the horizontal axis. Also, in a histogram, there are no regular spaces between the bars. Any spaces between bars in a histogram indicate that there are no data in the intervals represented by the spaces.

Scatter plots

  • All examples used thus far have involved data resulting from a single characteristic or variable. These types of data are referred to as univariate; that is, data observed for one variable. Sometimes data are collected to study two different variables in the same population of individuals or objects. Such data are called bivariate data.
  • To show the relationship between two numerical variables, the most useful type of graph is a scatter plot. In a scatter plot, the values of one variable appear on the horizontal axis of a rectangular coordinate system and the values of the other variable appear on the vertical axis. For each individual or object in the data, an ordered pair of numbers is collected, one number for each variable, and the pair is represented by a point in the coordinate system.
  • A scatter plot makes it possible to observe an overall pattern, or trend, in the relationship between the two variables. Also, the strength of the trend as well as striking deviations from the trend are evident. In many cases, a line or a curve that best represents the trend is also displayed in the graph and is used to make predictions about the population.

    Ex.   A bicycle trainer studied 50 bicyclists to examine how the finishing time for a certain bicycle race was related to the amount of physical training in the three months before the race. To measure the amount of training, the trainer developed a training index, measured in "units" and based on the intensity of each bicyclist's training. The data and the trend of the data, represented by a line, are displayed in the scatter plot below.

    In addition to the given trend line, you can see how scattered or close the data are to the trend line; or to put it another way, you can see how well the trend line fits the data. You can also see that the finishing times generally decrease as the training indices increase and that three or four data are relatively far from the trend.

    Several types of predictions can be based on the trend line. For example, it can be predicted, based on the trend line, that a bicyclist with a training index of  units would finish the race in approximately  hours. This value is obtained by noting that the vertical line at the training index of  units intersects the trend line very close to  hours.

    Another prediction based on the trend line is the number of minutes that a bicyclist can expect to lower his or her finishing time for each increase of  training index units. This prediction is basically the ratio of the change in finishing time to the change in training index, or the slope of the trend line. Note that the slope is negative. To estimate the slope, estimate the coordinates of any two points on the line—for instance, the points at the extreme left and right ends of the line:  and . The slope is

    which is measured in hours per unit. The slope can be interpreted as follows: the finishing time is predicted to decrease 0.026 hours for every unit by which the training index increases. Since we want to know how much the finishing time decreases for an increase of 10 units, we multiply the rate by 10 to get 0.26 hour per 10 units. To compute the decrease in minutes per 10 units, we multiply 0.26 by 60 to get approximately 16 minutes. Based on the trend line, the bicyclist can expect to decrease the finishing time by 16 minutes for every increase of 10 training index units.

Time plots

  • Sometimes data are collected in order to observe changes in a variable over time. For e.g. sales for a department store may be collected monthly or yearly. A time plot (sometimes called a time series) is a graphical display useful for showing changes in data collected at regular intervals of time.
  • A time plot of a variable plots each observation corresponding to the time at which it was measured. A time plot uses a coordinate plane similar to a scatter plot, but the time is always on the horizontal axis, and the variable measured is always on the vertical axis.
  • Additionally, consecutive observations are connected by a line segment to emphasize increases and decreases over time.

    Ex. 

    You can observe from the graph that the greatest increase in fall enrollment between consecutive years occurred from 2008 to 2009. One way to determine this is by noting that the slope of the line segment joining the values for 2008 and 2009 is greater than the slopes of the line segments joining all other consecutive years, because the time intervals are regular.

Numerical Methods for Describing Data - Measures of Central Tendency

  • Measures of central tendency indicate the "center" of the data along the number line and are usually reported as values that represent the data. There are three common measures of central tendency: (i) the arithmetic mean—usually called the average or simply the mean, (ii) the median, and (iii) the mode.
  • To calculate the mean of  numbers, take the sum of the  numbers and divide it by .

    Ex.  For the five numbers 6, 4, 7, 10, and 4, the mean is

    When several values are repeated in a list, it is helpful to think of the mean of the numbers as a weighted mean of only those values in the list that are different.

    Ex.  Consider the following list of 16 numbers.

    There are only 6 different values in the list: 2, 4, 5, 7, 8, and 9. The mean of the numbers in the list can be computed as

    The number of times a value appears in the list, or the frequency, is called the weight of that value. So the mean of the 16 numbers is the weighted mean of the values 2, 4, 5, 7, 8, and 9, where the respective weights are 1, 2, 1, 6, 2, and 4. Note that the sum of the weights is the number of numbers in the list, 16.

  • The mean can be affected by just a few values that lie far above or below the rest of the data, because these values contribute directly to the sum of the data and therefore to the mean. By contrast, the median is a measure of central tendency that is fairly unaffected by unusually high or low values relative to the rest of the data.
  • To calculate the median of n numbers, first order the numbers from least to greatest. If n is odd, then the median is the middle number in the ordered list of numbers. If n is even, then there are two middle numbers, and the median is the average of these two numbers.

    Ex.  For the five numbers 6, 4, 7, 10, and 4, the mean is

    The five numbers listed in increasing order are  so the median is 6, the middle number. Note that if the number 10 in the list is replaced by the number 24, the mean increases from 6.2 to

    but the median remains equal to 6. This example shows how the median is relatively unaffected by an unusually large value.

  • The median, as the "middle value" of an ordered list of numbers, divides the list into roughly two equal parts. However, if the median is equal to one of the data values and it is repeated in the list, then the numbers of data above and below the median may be rather different.
  • The mode of a list of numbers is the number that occurs most frequently in the list.

    Ex.  The mode of the numbers in the list 1, 3, 6, 4, 3, 5 is 3. A list of numbers may have more than one mode. For example, the list 1, 2, 3, 3, 3, 5, 7, 10, 10, 10, 20 has two modes, 3 and 10.

Numerical Methods for Describing Data - Measures of Position

  • The three most basic positions, or locations, in a list of data ordered from least to greatest are the beginning, the end, and the middle. It is useful here to label these as L for the least, G for the greatest, and M for the median.
  • Aside from these, the most common measures of position are quartiles and percentiles. Like the median M, quartiles and percentiles are numbers that divide the data into roughly equal groups after the data have been ordered from the least value L to the greatest value G. There are three quartile numbers that divide the data into four roughly equal groups, and there are 99 percentile numbers that divide the data into 100 roughly equal groups. As with the mean and median, the quartiles and percentiles may or may not themselves be values in the data.
  • The first quartile Q1, the second quartile Q2 (which is simply the median M), and the third quartile Q3 divide a group of data into four roughly equal groups as follows.
    • the first group consists of the data from L to Q1,
    • the second group is from Q1 to M,
    • the third group is from M to Q3,
    • and the fourth group is from Q3 to G.
  • Because the number of data in a list may not be divisible by 4, there are various rules to determine the exact values of Q1 and Q3 and some statisticians use different rules, but in all cases Q2_M. We use perhaps the most common rule, in which Q2_M divides the data into two equal parts—the lesser numbers and the greater numbers—and then Q1 is the median of the lesser numbers and Q3 is the median of the greater numbers.

    Ex.  To find the quartiles for the list of  numbers  (already listed in order), first divide the data into two groups of 8 numbers each. The first group is  and the second group is  so that the second quartile, or median, is . To find the other quartiles, you can take each of the two smaller groups and find its median: the first quartile is  (the average of  and ) and the third quartile is  (the average of and ).

    In the above example, note that the number  is in the lowest  percent of the distribution of data. There are different ways to describe this. We can say that  is below the first quartile, that is, below ; we can also say that  is in the first quartile. The phrase "in a quartile" refers to being in one of the four groups determined by , and .

  • Percentiles are mostly used for very large lists of numerical data ordered from least to greatest. Instead of dividing the data into four groups, the 99 percentiles . divide the data into 100 groups. Consequently, , and . Because the number of data in a list may not be divisible by , statisticians apply various rules to determine values of percentiles.

Numerical Methods for Describing Data - Measures of Dispersion

  • Measures of dispersion indicate the degree of "spread" of the data. The most common statistics used as measures of dispersion are the range, the interquartile range, and the standard deviation. These statistics measure the spread of the data in different ways.
  • The range of the numbers in a group of data is the difference between the greatest number  in the data and the least number  in the data; that is, . For e.g. given the list  the range of the numbers is .
  • The simplicity of the range is useful in that it reflects that maximum spread of the data. However, sometimes a data value is so unusually small or so unusually large in comparison with the rest of the data that it is viewed with suspicion when the data are analyzed—the value could be erroneous or accidental in nature. Such data are called outliers because they lie so far out that in most cases, they are ignored when analyzing the data. Unfortunately, the range is directly affected by outliers.
  • A measure of dispersion that is not affected by outliers is the interquartile range. It is defined as the difference between the third quartile and the first quartile, that is, . Thus, the interquartile range measures the spread of the middle half of the data.

    Ex.  In the list of  numbers , the range is , the first quartile is , and the third quartile is . So the interquartile range for the numbers in this list is . One way to summarize a group of numerical data and to illustrate its center and spread is to use the five numbers , and . These five numbers can be plotted along a number line to show where the four quartile groups lie. Such plots are called boxplots or box-and-whisker plots, because a box is used to identify each of the two middle quartile groups of data, and "whiskers" extend outward from the boxes to the least and greatest values. The following graph shows the boxplot for the list of  numbers in this example.

  • There are a few variations in the way boxplots are drawn—the position of the ends of the boxes can vary slightly, and some boxplots identify outliers with certain symbols—but all boxplots show the center of the data at the median and illustrate the spread of the data in each of the four quartile groups.

    Ex.  Two large lists of numerical data, list I and list II, are summarized by the following boxplots.

    Based on the boxplots, several different comparisons of the two lists can be made. First, the median of list II, which is approximately , is greater than the median of list I, which is approximately . Second, the two measures of spread, range and interquartile range, are greater for list I than for list II. For list I, these measures are approximately  and , respectively; and for list II, they are approximately  and , respectively.

  • Unlike the range and the interquartile range, the standard deviation is a measure of spread that depends on each number in the list. Using the mean as the center of the data, the standard deviation takes into account how much each value differs from the mean and then takes a type of average of these differences. As a result, the more the data are spread away from the mean, the greater the standard deviation; and the more the data are clustered around the mean, the lesser the standard deviation.
  • The standard deviation of a group of  numerical data is computed by  calculating the mean of the  values,  finding the difference between the mean and each of the  values,  squaring each of the differences,  finding the average of the  squared differences, and  taking the nonnegative square root of the average squared difference.

    Ex.  For the five data  and , the standard deviation can be computed as follows. First, the mean of the data is , and the squared differences from the mean are

    or . The average of the five squared differences is  , or 13.6, and the positive square root of  is approximately .

  • Note on terminology: The term "standard deviation" defined above is slightly different from another measure of dispersion, the sample standard deviation. The latter term is qualified with the word "sample" and is computed by dividing the sum of the squared differences by  instead of . The sample standard deviation is only slightly different from the standard deviation but is preferred for technical reasons for a sample of data that is taken from a larger population of data. Sometimes the standard deviation is called the population standard deviation to help distinguish it from the sample standard deviation.

    Ex.  Six hundred applicants for several post office jobs were rated on a scale from 1 to 50 points. The ratings had a mean of 32.5 points and a standard deviation of 7.1 points. How many standard deviations above or below the mean is a rating of 48 points? A rating of 30 points? A rating of 20 points?

    Sol. Let  be the standard deviation, so  points. Note that 1 standard deviation above the mean is

    and 2 standard deviations above the mean is

    So  is a little more than  standard deviations above the mean. Since  is actually  points above the mean, the number of standard deviations that  is above the mean is  . Thus, to answer the question, we first found the difference from the mean and then we divided by the standard deviation. The number of standard deviations that a rating of  is away from the mean is

    where the negative sign indicates that the rating is 0.4 standard deviation below the mean.

    The number of standard deviations that a rating of 20 is away from the mean is

    where the negative sign indicates that the rating is  standard deviations below the mean.

    To summarize:

    • 48 points is 15.5 points above the mean, or approximately 2.2 standard deviations above the mean.
    • 30 points is 2.5 points below the mean, or approximately 0.4 standard deviation below the mean.
    • 20 points is 12.5 points below the mean, or approximately 1.8 standard deviations below the mean.

    One more instance, which may seem trivial, is important to note:

    • 32.5 points is 0 points from the mean, or approximately 0 standard deviations from the mean. 
  • The process of subtracting the mean from each value and then dividing the result by the standard deviation is called standardization. Standardization is a useful tool because for each data value, it provides a measure of position relative to the rest of the data independently of the variable for which the data was collected and the units of the variable.
  • Note that the standardized values  and  from the above example are all between  and ; that is, the corresponding ratings  and  are all within  standard deviations above or below the mean. This is not surprising, based on the following fact about the standard deviation.

       

In any group of data, most of the data are within about 3 standard deviations above or below the mean.

Thus, when any group of data are standardized, most of the data are transformed to an interval on the number line centered about  and extending from about  to . The mean is always transformed to .

 

Comments Add Comment
Ask a Question