Apr 072014

Last week Jeffrey Henning gave a great #NewMR lecture on how to improve the representativeness of online surveys (click here to access the slides and recordings). During the lecture he touched lightly on the topic of calculating sampling error from non-probability samples, pointing out that it did not really do what it was supposed to. In this blog I want to highlight why I recommend using this statistic as a measure of reliability, but not validity.

If we calculate the sampling error for a non-probability sample, for example from an online access panel, we are not representing the wider population. The population for this calculation is just those people who might have taken the survey. For example, just those members of the online access panel who met the screening criteria and who were willing (during the survey period) to take the study. The sampling error tells us how good our estimates of this population are (i.e. those members of the panel who met the criteria and who were willing to take a survey at that particular time).

If we take a sample of 1000 people from an online access panel and we calculate that the confidence interval is +/-3% at the 95% level, what we are saying is that if we had done another test, on the same day, with the same panel, with a different group of people, we are 95% sure that the answer we would have got would have been within 3% of the first test. That is a measure of reliability. But we are not saying that if we had measured the wider population the answer would have been within 3%, or 10% or any other number we could quote.

The sampling error statistic from a panel is not about validity, since we can’t estimate how representative the panel is of the wider population. But, it does give us a statistical measure of how likely we are to get the same answer again if we repeat the study on the same panel, with the same sample specification, during the same period of time – which is a pretty good statement of reliability.

Note, to researchers reliability is about whether something measures the same way each time. Validity relates to whether what is measured is correct. A metal metre ruler that is 10cm short is reliable, it is always 10 cm short, but it is not as valid as we would like.

My recommendation is to calculate the sampling error and use it to indicate which values from the non-probability sample are at least big enough to be reliable. But let’s not claim it represents the sampling error of the wider population, nor that it directly links to validity.

I would recommend adding text something like: “The sampling reliability of this estimate at the 95% level is +/- X%, which means that if we used the same sampling source 20 times, with the same specification, we would expect the answers to be within X% 19 times.”

Total Survey Error
Another reason to be careful with sampling error is that it is only one source of error in a survey. Asking leading questions, asking questions that people can’t answer (for example because we are poor witnesses to our own plans and motivations), or asking questions that people don’t want to answer (for example because of social desirability bias), can all result in much bigger problems than sampling error.

Researchers can sometimes be too worried about sampling error, leading them to ignore much bigger sources of error in their work.


Jun 292013

The ITU (the International Telecommunication Union, the UN agency that looks after ICT – information and communication technologies) has produced a useful update on ICT facts and figures.

The report is well worth reading and shows, amongst other things:

  • As more and more mobile phones are bought, the growth is slowing. In 2005/6 the global growth rate in cellular subscriptions was just under 25%. In 2012/13 it was down to just over 5%. In the developing world the growth has fallen from over 30% in 2005/6 to just over 6% now. None of which is surprising, but it is nice to know the numbers.
  • The internet continues to grow in all regions and globally. With 77% in the developed world having internet access, and 31% in the developing world.
  • Globally just under 3 billion people are using the internet, almost 40% of the population.
  • About 50% of the households with access to the internet are in the developing world (although that is a much lower penetration rate than in the developed world, 28% in the developing world and 78% in the developed world).
  • Fixed-broadband is much cheaper in the developed world than the developing world, although the price has been falling in the developing world. Costs in the report are measured as a percentage of GNI (Gross National Income – roughly, the amount the whole country earns) per person. In the developed world fixed broadband costs under 2% of average monthly income, in the developing world it costs over 30% of average monthly income.
  • Fixed broadband in the developing world is growing, but is still only 6%, compared with 27% in the developed markets. However, over 50% of the households with fixed-broadband are in the developing world, because it is larger.
  • The four countries with the highest percentage of their fixed-broadband being high-speed are: South Korea, Hong Kong, Japan, and Bulgaria.
  • Mobile broadband subscriptions have grown from under 300 million in 2007, to 2 billion in 2013.
  • In the developing countries mobile broadband is more expensive than it is in the developed markets, but cheaper than fixed broadband in the developing markets.
  • In Africa mobile broadband subscriptions cost about 50% of average income, compared with less than 2% in Europe.

The ITU is 100% wrong on penetration
So, it is a pity that the ITU refer to a highly misleading statistic in their report, which challenges the value the way that data from the ITU will be considered. And, it is a pity that some people in and around the market research world have picked up on this misleading number.

What is this misleading statistic? I am referring to the part of the report where the ITU says that the penetration of mobile-cellular is 96% globally and approaching 100%. It then compounds its dodgy use of language when it describes the penetration in the developed world as 128%, and describes mobile-cellular penetration as 170% in the CIS (a subset of the countries that used to be in Soviet Union, including Russia).

Let’s just think about 100% for one moment. In the way we normally use the phrase (for products, diseases, education, services) 100% would mean every baby, every prisoner, every homeless person would have one. For example, when we estimate the penetration of a TV show we interview a representative sample and gross up to the population. Clearly, it would be a nonsense to claim that 100% of people have a mobile phone. By the time we get to 170%, we can see that the ‘normal’, or useful definition of penetration is not the one they are using.

So, what do the reports of 100% penetration mean? Read the non-nonsense bits of the ITU report and you will notice that the team who have produced the charts (as opposed to the copy) refer to mobile-cellular subscriptions, and mobile-cellular subscriptions per 100 people. It is a pity that the copywriters did not follow the lead of the ITU people who worked on the charts.

What are mobile-cellular subscriptions? Very roughly, the number of subscriptions is the number of sims in use. If somebody has two phones, that is two sims, two subscriptions. If somebody has a dual-sim phone, that is two sims, and is often two subscriptions. If somebody has two phones, a tablet, and a mobile modem, they have four sims.

Am I just being pedantic, or does it matter? Yes, in my opinion it matters. Because people are quoting these super high ‘penetration’ rates there is an assumption that catering for mobile phone users, in and of itself, avoids excluding people. We can use the UK as a good example. The ITU figures for the UK, in 2011, says there were 131 subscriptions per 100 people – a figure the ITU copywriters and careless MR tweeters would call 131% penetration. However, the UK’s General Lifestyle Survey found that in 2011 one-in-seven households had zero mobile phones (i.e. 86% of households had at least one person in it who had at least one mobile phone). Data collected in the UK by the communication regulator (Ofcom) estimate that at the end of 2012 92% of adults owned or had the use of a mobile phone.

In the developed markets, such as the UK, the difference between a penetration rate of 131-132% of the total population (babies and all) and a real rate 0f 86-92% of adults is not particularly important. But if the ratio in the UK is typical, the ITU figure of 100% global could mean about two-thirds of adults have the use of a mobile phone, and that does matter. For example, it means research projects requiring a good representation of people, in some countries, cannot assume that mobile is currently a safe option.

Dec 182012

I am in the process of writing an introductory statistics book for market researchers. This post and some of the following posts are taken from that book, in an attempt to field test the style, approach, and depth I am employing. All comments welcome.

My recommendation is that most numbers in presentations and reports should be presented as 2 or 3 significant digits. I feel that the issue of significant digits is more important than the more frequently discussed issue of decimal places.

In a number, the significant digits are those that carry the key details. If a bank robber steals $56 million, the 5 and the 6 are the significant digits – and the million gives the scale of the number. If we say that PI is 3.1416 then we are showing it to four decimal places and five significant digits.

Table 1 shows the number of internet users in five key, original, members of the EU; showing the raw numbers and the same numbers using two significant digits.

Column B shows the estimates in the format they were downloaded from the InternetWorldStats website. These raw numbers contain 7 or 8 digits, and commas are used to help make the numbers more readable. These values, presumably, represent the best estimates for each country, but they require an active act to read and interpret. By contrast, Column C shows the numbers using just two significant digits.

The use of two significant digits in Column C has two advantages, when compared with Column B.

  1. It is much easier see the relationships in Column C, compared with Column B. For example, in Column C, it is easy to see that Italy has just over twice as many internet users as the Netherlands, and about half as many as Germany. This information is harder to see at a glance in Column B.
  2. Almost all numbers have errors in them, and they tend to relate to a specific moment in time. Statisticians talk of spurious accuracy when too many digits are displayed, for example when saying 37.67% plus or minus 10%. If we use all of the digits, as in Column B, then we are implying (to most readers) that all the digits are equally accurate. By using just the two most significant digits, Column C gives a message to the reader that these are approximations.

Methods of utilising 2 or 3 significant digits
Here are some tips for different situations:
  1. Percentages. Only use round numbers, e.g. 36% rather than 35.67%.
  2. Salaries. Round them to the nearest thousands, for example $136K, rather than $135,670.
  3. 7-point rating scales. One decimal place, for example 4.6 rather than 4.634.
  4. Sales. Round the numbers to the nearest thousands, million, or billions. For example, numbers like 36,785 and 76,230 could be expressed as 37K and 76K (two significant digits). However, 36,785, 76,230 and 148, 102 would need to be shown as 37K, 76K, and 148K (three significant digits).

Ralph Waldo Emmerson said “A foolish consistency is the hobgoblin of small minds”, and it would be foolish to think that every set of numbers can be shown to two or three significant digits. Background documents, notes, and tables are often better with more digits.

However, in most cases, and in most presentations and reports, two or three significant digits are going to help the audience/reader understand the message better than showering them with digits.