Test–retest stability of patient experience items derived from the national GP patient survey
© The Author(s) 2016
Received: 30 June 2016
Accepted: 23 September 2016
Published: 7 October 2016
The validity and reliability of various items on the GP Patient Survey (GPPS) survey have been reported, however stability of patient responses over time has not been tested. The purpose of this study was to determine the test–retest reliability of the core items from the GPPS.
Patients who had recently consulted participating GPs in five general practices across the South West England were sent a postal questionnaire comprising of 54 items concerning their experience of their consultation and the care they received from the GP practice. Patients returning the questionnaire within 3 weeks of mail-out were sent a second identical (retest) questionnaire. Stability of responses was assessed by raw agreement rates and Cohen’s kappa (for categorical response items) and intraclass correlation coefficients and means (for ordinal response items).
348 of 597 Patients returned a retest questionnaire (58.3 % response rate). In comparison to the test phase, patients responding to the retest phase were older and more likely to have white British ethnicity. Raw agreement rates for the 33 categorical items ranged from 66 to 100 % (mean 88 %) while the kappa coefficients ranged from 0.00 to 1.00 (mean 0.53). Intraclass correlation coefficients for the 21 ordinal items averaged 0.67 (range 0.44–0.77).
Formal testing of items from the national GP patient survey examining patient experience in primary care highlighted their acceptable temporal stability several weeks following a GP consultation.
Patient surveys have been adopted widely, both in the UK and elsewhere, as a means of capturing patients’ experience of care delivered in primary and secondary care settings. Information obtained through such surveys offers the potential to inform service development and continuous quality improvement. Although offering such potential, previous research has identified concerns raised by doctors and others concerning the reliability and credibility of survey results (Asprey et al. 2013). A recent study exploring the views of primary care staff around the utility of patient experience surveys highlighted concerns regarding the perceived weakness of survey methods, the robustness of local surveys, and the rigidity of survey methodology in accurately capturing the complexity of health-care interactions (Boiko et al. 2013). A range of primary care patient experience surveys have been published which have been subjected to formal psychometric testing including some where the stability of responses over time have been documented (Wright et al. 2012; Greco et al. 1999; Mead et al. 2008; Greco and Sweeney 2003; Pettersen et al. 2004).
The English GP Patient Survey (GPPS) is a large-scale survey of patient experience of primary care routinely reported at the level of data aggregated by practice. Evidence supporting the validity and reliability of the questionnaire has already been published (Campbell et al. 2009). The questionnaire items address a range of issues relating to the accessibility of care, the quality of interpersonal care and a number of other important domains including an overall impression of patient satisfaction with care.
Although the GPPS is the largest survey of primary care undertaken in England, and having results which directly inform the NHS outcomes framework (Department of Health 2013), the stability of patient responses over time has not been tested or reported. Governance restrictions on the national GP patient survey preclude an evaluation of test–retest stability in the national data. We therefore aimed to explore this important aspect of the performance of the questionnaire using items from the national survey deployed in a postal survey in primary care.
Patients over the age of 18 who had attended a consultation with their general practitioner within the previous 21 day period were sent a postal questionnaire.
As part of a larger study examining patient’s report of their experience of care provided by general practitioners (Roberts et al. 2014), we invited five practices to take part in the test phase between November 2011 and June 2013 across the South West of England (Bristol, Devon, and Cornwall). Non-training grade GPs within participating practices who worked less than four sessions a week, locums and GPs in training were excluded from the study. Approval for the study was obtained from the South West 2 Research Ethics Committee on 28 January 2011 (ref: 09/H0202/65).
Searches were carried out on practice computer systems, generating lists of patients who had face-to-face consultations with participating doctors within a 21 day period prior to the search. Doctors screened their lists to exclude recent deaths, terminal illness, and mental incapacity. Eligible patients were posted a questionnaire pack containing a practice headed invitation letter, study information sheet, questionnaire and a prepaid return envelope. The patient information sheet provided an outline of the study; patients’ consent to participation was inferred by the return of a completed questionnaire.
Our questionnaire (“Appendix”) was based closely on the national GP Patient Survey (GPPS: Year 5 Q4 version). The questionnaire included questions on access, waiting times, opening hours, continuity and interpersonal aspects of care and basic socio-demographic questions, such as age, gender and ethnicity. All the questions apart from the interpersonal and continuity items were identical to the GPPS (Roberts et al. 2014). As the primary aim of the main study was to focus on patient assessment of individual doctors’ communication skills, patients were asked to complete seven items relating to inter-personal aspects of doctors care, and relating to continuity of care, referencing a consultation with a named GP on a specific date, as stated in a covering letter (Roberts et al. 2014). This context was slightly different from that of the national GPPS where patients are asked to complete these items in relation to all consultations that have occurred over the past 6 months, rather than a specific consultation with a named GP. This was different from the national GPPS where patients are asked to reflect on consultations that have occurred over the past 6 months, rather than a specific consultation. Our questionnaire contained 38 closed questions, 13 of which covered the socio-demographic profile of the patient and were not analysed in this study. The 25 questions covering the patients’ experience of care at the practice comprised 54 separate response items related to making an appointment (12), telephone access (4), access to a doctor (11), arriving at the appointment (6), continuity of care (3), opening hours (8), doctor-patient communication and trust (8), and overall satisfaction (2). Response options were categorical for 33 of these items and ordinal for the remaining 21.
Sample size calculation
Calculations based on Fisher’s z-transformation of the intraclass correlation coefficient (ICC) showed that a sample of 250 completed re-test questionnaires would allow us to estimate ICCs for individual items with a 95 % margin of error of less than 0.1 for coefficients around 0.5 and less than 0.05 for coefficients of 0.8 or above. Based on response rates in initial pilot data of 38 questionnaires per doctor returned within 3 weeks and an estimated response rate of 75 % for the retest phase that we had observed in earlier study (Wright et al. 2012), we sought to recruit the patients from a minimum of nine doctors to this study.
Data analysis was conducted using SPSS version 18 (SPSS 2009). We described the response rate and response timings for both test and retest phases and compared the demographic profiles (age, gender, ethnicity) of three groups of patients: those who were sent but did not return a test questionnaire within 3 weeks of mail out (and so were not eligible for the retest phase), those who were sent but did not return a retest questionnaire within 4 weeks of mail out, and those who returned both a test and a retest questionnaire within the deadlines. The proportions of non-response by patients eligible to answer each of the 54 separate response items were compared between the test and retest phases using Chi squared tests with a Holm-Bonferroni correction for multiple comparisons (Holm 1979). For the 33 categorical response items we measured test–retest reliability using raw agreement rates and Cohen’s Kappa statistic (Cohen 1968). For the 21 ordinal response items we assigned integer scores (1, 2, 3, etc.) to the meaningful response options (apart from any ‘Don’t know’ or ‘Not applicable’ options) and calculated ICCs. Both the ICCs and the Kappa statistics were interpreted as follows: <0.00 was poor, 0.00–0.20 was slight, 0.21–0.40 was fair, 0.41–0.60 was moderate, 0.61–0.8 was substantial and 0.81–1.00 was almost perfect (Landis and Koch 1977). We calculated the mean score on each item in the test and retest phases and investigated possible changes in the mean scores using paired sample t-tests, again with a Holm-Bonferroni correction.
Demographic characteristics of patient sample by level of study participation with P value for tests of variation across the three groups
Patients sent but not returning a test questionnaire within 3 weeks of mail out.a
Patients sent but not returning a retest questionnaire within 4 weeks of mail out.
Patients returning both a test and a retest questionnaire within the deadlines.
Number (%) male
Number (%) white Britisha
Mean (SD) age in years
No significant differences in item non-response rates between the test and retest phase were found for any of the 54 items.
Test–retest reliability of categorical items
Sample size, raw agreement (%) and Cohen’s kappa statistic for the 33 categorical items
Raw agreement (%)
Making an appointment
Q1a Normally book an appointment in person
Q1b Normally book an appointment by phone
Q1c Normally book an appointment by fax
Q1d Normally book an appointment online
Q1e Normally book an appointment by digital TV
Q1f Booking doesn’t apply
Q2a Prefer to book in person
Q2b Prefer to book by phone
Q2c Prefer to book by fax
Q2d Prefer to book online
Q2e Prefer to book by digital TV
Q2f No preference in booking an appointment
Access to a doctor
Q4 In the past 6 months, have you tried to see the doctor quickly
Q5 Were you able to see the doctor quickly
Q6a If you couldn’t be seen quickly was this because there were no appointments
Q6b If you couldn’t be seen quickly was this because there the times did not suit you
Q6c If you couldn’t be seen quickly was this because the appointment was with a doctor you didn’t want to see
Q6d If you couldn’t be seen quickly was this because the appointment offered was with a nurse and you wanted to see a doctor
Q6e If you couldn’t be seen quickly was this because you were offered an appointment at a different branch
Q6f If you couldn’t be seen quickly was this because there was a different reason
Q6 g Can’t remember why you were unable to be seen quickly
Q7 In the past 6 months, have you tried to book ahead for an appointment with a doctor
Q8 Were you able to get an appointment with a doctor more than 2 weekdays ahead
Arriving at the appointment
Q11 In the reception area, can other patients overhear what you say to the receptionist
Continuity of care
Q15 Is there a particular doctor you prefer to see
Q17 Was your consultation with your preferred doctor
Q19a As far as you know is the surgery open before 0800
Q19b As far as you know is the surgery open at lunchtime
Q19c As far as you know is the surgery open after 1830
Q19d As far as you know is the surgery open on Saturdays
Q19e As far as you know is the surgery open on Sundays
Q20 Would you like the surgery to be open at additional times
Q21 Which additional time would you most like your surgery to be open
Test–retest reliability of ordinal response items
Sample size, ICC (95 % confidence interval), mean test–retest difference (95 % confidence interval) and associated P value for the 21 ordinal response items
ICC (95 % CI)
Mean difference (95 % CI)
Q3a How easy have you found getting through on the phone
0.73 (0.67, 0.78)
−2.40 (−4.91, 0.11)
Q3b How easy have you found speaking to a doctor on the phone
0.68 (0.59, 0.75)
−4.01 (−7.64, −0.39)
Q3c How easy have you found speaking to a nurse on the phone
0.63 (0.48, 0.75)
−2.85 (−8.62, 2.93)
Q3d How easy have you found getting test results on the phone
0.62 (0.51, 0.72)
0.25 (−3.88, 4.39)
Arriving at the appointment
Q9 How easy do you find it to get into the building at this GP surgery or health centre?
0.44 (0.35, 0.52)
2.32 (0.94, 3.70)
Q10 How clean is this GP surgery or health centre?
0.60 (0.53, 0.66)
1.16 (−0.10, 2.42)
Q12 How helpful do you find the receptionists at this GP surgery or health centre?
0.69 (0.63, 0.74)
−0.60 (−2.39, 1.20)
Q13 How long after your appointment time do you normally wait to be seen?
0.67 (0.60, 0.73)
−0.95 (−2.60, 0.70)
Q14 How do you feel about how long you normally have to wait
0.70 (0.64, 0.75)
−2.11 (−4.43, 0.21)
Continuity of care
Q 16 How often do you see the doctor you prefer
0.71 (0.64, 0.77)
−0.78 (−3.49, 1.92)
Q18 How satisfied are you with the hours that this GP surgery or health centre is open?
0.65 (0.59, 0.71)
2.23 (0.40, 4.06)
Doctor-patient communication and trust
Q22a How good was the doctor at giving you enough time
0.62 (0.55, 0.68)
0.45 (−0.96, 1.85)
Q22b How good was the doctor at asking about your symptoms
0.70 (0.64, 0.75)
−0.47 (−1.84, 0.90)
Q22c How good was the doctor at listening to you
0.72 (0.66, 0.77)
0.38 (−0.88, 1.63)
Q22d How good was the doctor at explaining tests and treatments
0.72 (0.65, 0.77)
−1.27 (−2.81, 0.26)
Q22e How good was the doctor at involving you in decisions about your care
0.68 (0.61, 0.73)
−1.00 (−2.65, 0.65)
Q22f How good was the doctor at treating you with care and concern
0.67 (0.61, 0.73)
0.23 (−1.16, 1.62)
Q22 g How good was the doctor at taking your problems seriously
0.72 (0.67, 0.77)
−0.08 (−1.46, 1.31)
Q23 Did you have confidence and trust in doctor you saw
0.70 (0.64, 0.75)
−0.15 (−1.86, 1.57)
Q24 In general how satisfied are you with the care you get at this surgery or health centre?
0.74 (0.69, 0.78)
−0.58 (−1.81, 0.65)
Q25 Would you recommend this GP surgery or health centre to someone who has just moved to your local area?
0.77 (0.73, 0.81)
0.00 (−1.51, 1.51)
Overall, the test–retest reliability results of the survey items varied, with some items showing poor agreement over an interval of approximately 4 weeks between test and retest questionnaires (ease of access to the building), whilst others achieved excellent agreement (patients’ willingness to recommend their practice to a family member or friend).
A large majority of the ordinal items had high ICC values and high response agreements. Items relating to staff performance (such as helpfulness of receptionists, communication skills of GPs) achieved high stability over time. One possible reason for this might be that face to face interaction with staff, whether it is with receptionists or with health professionals within a consultation, has a lasting impact on a patient’s memory when compared with other experiences of the practice.
The categorical items achieved fair to almost perfect agreement, with the majority of the items demonstrating moderate to substantial agreement. Given that the kappa statistic is adversely affected by prevalence rates for dichotomous items (Feinstein and Cicchetti 1990), there is, we believe, a good case for giving greater weight to the raw agreement rates, which were observed to be high. Despite the mixed Kappa scores, the majority of the percentage agreements between the test and retest phases indicated good to excellent reliability of questions.
Strengths and limitations
This is the first study to report on the stability of patient responses over time using items from the GP patient survey. The response rate for our retest phase was good, and similar to that observed in other test–retest exit surveys in primary care (Wright et al. 2012; Mead et al. 2008). Our sample was not fully representative of the wider patient population within England and Wales, thus a more diverse test–retest study involving patients from different ethnic and age groups needs to be conducted to fully understand the stability of responses from different patient groups over time. There are methodological limitations to the test–retest method. Karras (1997) suggested that one disadvantage of test–retest method is that the first administration of the questionnaire could influence the results of the second administration, in that the respondent could recall the answers provided at the test phase and replicate this for the retest phase (Karras 1997). Kimberlin and Winterstein highlight the trade-off between potential recall bias when the retest interval is short and the possibility that what you are measuring will have changed when the retest interval is large (Kimberlin and Winterstein 2008). In addition, patients could have attended further consultations with a doctor in the period between the test and retest phase which could have altered their responses on the retest questionnaire. We did not ask, or identify from practice records, whether or not respondents had visited their practice since the appointment that was specified in the covering letter accompanied with both the test and retest questionnaires, which should be consideration for future research. Due to design of the study focusing on specific consultations with named GPs, we were unable to replicate the exact timings of when the national GPPS questionnaires are distributed to patients following their consultations.
Comparison with existing literature
Previous work has suggested that the timing of administering a questionnaire may have an impact on the patient’s reported satisfaction with a service (Crow et al. 2002; Allemann Iseli et al. 2014; Kong et al. 2007). Research addressing the issue of recall bias suggests that health status should be measured at short intervals, for example 1–2 weeks. (Kimberlin and Winterstein 2008; Patrick 1991). If the focus of the research is to recall specific events the time frame must of short duration and in the immediate past. Our study adopted a short interval, referring, as it did, to a consultation which had taken place within the past 3 weeks. A longer time interval between the patient’s consultation and receipt of a questionnaire may influence their recollection of the consultation (Selic et al. 2011; Sandberg et al. 2008). This brings into question the accuracy of reflections regarding a consultation which may be temporally remote. For example, the GP patient survey invites patients to comment on consultations which may have taken place up to 6 months previously.
Whilst some patient surveys used in the UK have been validated and psychometrically tested, test–retest reliability has not been reported for all surveys (Lockyer 2009). The questionnaire used in our study measured patients’ overall experience of their GP surgery, and incorporated items addressing GP communication skills, practice environment, access and overall satisfaction. The only questionnaires used widely within the UK and addressing a similar agenda are the General Practice Assessment Survey (GPAS) and Questionnaire (GPAQ) and the Improving Practice Questionnaire (IPQ), neither of which have the breadth of topic coverage seen in our survey. Test–retest reliability for GPAS was assessed when patients were asked to complete their test questionnaire at the practice following a consultation; the retest questionnaire was posted to them 1 week later, however the sample size used was considerably smaller in comparison to the sample within our study (Ramsay et al. 2000).
This is the first test–retest study carried out on items derived from the national GP patient survey. Testing the stability of these particular GPPS items was important if GPs and policy makers want to assess how patient experience of primary care services has changed over time. Our findings indicate that most of the items considered within this survey have acceptable test–retest reliability across a short time interval, with items relating to staff achieving high reliability. The findings raise some concerns regarding the reliability of certain items in the survey in the time frame for which it was tested and have implications for the need to test the reliability of item responses over the longer time interval used in the national GPPS. Further research might usefully explore the performance of the national survey in more diverse samples across England and Wales, and across the longer time interval it encompasses.
All authors contributed to the design of the study. AD conducted the data collection and prepared the manuscript. IM assisted in the data collection. LM and MR provided a detailed analysis of the dataset. All authors provided critical comments on the manuscript. All authors read and approved the final manuscript.
The authors gratefully acknowledge patients, doctors and other staff in the GP surgeries who took part in this study.
Funding was provided by Health Services and Delivery Research Programme (Grant No. RP-PG-0608-10050).
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Allemann Iseli M, Kunz R, Blozik E (2014) Instruments to assess patient satisfaction after teleconsultation and triage: a systematic review. Patient Prefer Adherence 8:893–907. doi:10.2147/PPA.S56160 PubMedPubMed CentralGoogle Scholar
- Asprey A, Campbell JL, Newbould J, Cohn S, Carter M, Davey A, Roland M (2013) Challenges to the credibility of patient feedback in primary healthcare settings: a qualitative study. Br J Gen Pract. doi:10.3399/bjgp13X664252 PubMedPubMed CentralGoogle Scholar
- Boiko O, Campbell JL, Elmore N, Davey AF, Roland M, Burt J (2013) The role of patient experience surveys in quality assurance and improvement: a focus group study in English general practice. Health Expect. doi:10.1111/hex.12298 Google Scholar
- Campbell JL, Smith P, Nissen S, Bower P, Roland M (2009) The GP Patient Survey for use in primary care in the National Health Service in the UK—development and psychometric characteristics. BMC Fam Pract. doi:10.1186/1471-2296-10-57 PubMedPubMed CentralGoogle Scholar
- Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 70(4):213–220View ArticlePubMedGoogle Scholar
- Crow R, Gage H, Hampson S et al (2002) The measurement of satisfaction with healthcare: implications for practice from a systematic review of the literature. Health Technol Assess 6(32):1–244View ArticlePubMedGoogle Scholar
- Department of Health (2013) The NHS outcomes framework 2014/15. Department of Health, November, LondonGoogle Scholar
- Feinstein AR, Cicchetti DV (1990) High agreement but low Kappa: I. The problems of two paradoxes. J Clin Epidemiol 43(6):543–549View ArticlePubMedGoogle Scholar
- Greco MC, Sweeney K (2003) The Improving Practice Questionnaire (IPQ), a practical tool for general practices seeking patient views. Educ Prim Care 14(4):440–448Google Scholar
- Greco MC, Brownlea A, McGovern J (1999) Validation studies of the doctors’ interpersonal skills questionnaire. Educ Prim Care 10:256–264Google Scholar
- Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat. doi:10.2307/4615733 MathSciNetMATHGoogle Scholar
- Karras DJ (1997) Statistical methodology: II. Reliability and validity assessment in study design, Part A. Acad Emerg Med 4(1):64–71MathSciNetView ArticlePubMedGoogle Scholar
- Kimberlin CL, Winterstein AG (2008) Validity and reliability of measurement instruments used in research. Am J Health Syst Pharm 65:2276–2284View ArticlePubMedGoogle Scholar
- Kong MC, Camacho FT, Feldman SR et al (2007) Correlates of patient satisfaction with physician visit: differences between elderly and non-elderly survey respondents. Health Qual Life Outcomes. doi:10.1186/1477-7525-5-62 PubMedPubMed CentralGoogle Scholar
- Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174MathSciNetView ArticlePubMedMATHGoogle Scholar
- Lockyer JF (2009) Comparison of patient satisfaction instruments designed for GPs in the UK. Royal College of General PractitionersGoogle Scholar
- Mead N, Bower P, Roland M (2008) The general practice assessment questionnaire (GPAQ)—development and psychometric characteristics. BMC Fam Pract. doi:10.1186/1471-2296-9-13 PubMedPubMed CentralGoogle Scholar
- Deyo RA, Diehr P, Patrick, DL (1991) Reproducibility and responsiveness of health status measures. Statistics and strategies for evaluation. Control Clin Trials 12(4 Suppl):142S–158SGoogle Scholar
- Pettersen KI, Veenstra M, Guldvog B, Kolstad A (2004) The patient experiences questionnaire: development, validity and reliability. Int J Qual Health Care 16(6):453–463View ArticlePubMedGoogle Scholar
- Ramsay J, Campbell JL, Schroter S et al (2000) The General Practice Assessment Survey (GPAS): tests of data quality and measurement properties. Fam Pract 17(5):372–379View ArticlePubMedGoogle Scholar
- Roberts MJ, Campbell JL, Abel GA, Davey AF, Elmore N et al (2014) Understanding high and low patient experience scores in primary care: analysis of patients’ survey data for general practices and individual doctors. Br Med J. doi:10.1136/bmj.g6034 Google Scholar
- Sandberg EH, Sharma R, Wiklund R, Sandberg WS (2008) Clinicians consistently exceed a typical person’s short-term memory during preoperative teaching. Anesth Analg 107(3):972–978. doi:10.1213/ane.0b013e31817eea85 View ArticlePubMedGoogle Scholar
- Selic P, Svab I, Repolusk M, Gucek NK (2011) What factors affect patients’ recall of general practitioners’ advice? BMC Family Pract. doi:10.1186/1471-2296-12-141 Google Scholar
- SPSS (2009) PASW statistics for windows, 18th edn. Chicago 2009Google Scholar
- Wright C, Richards SH, Hill JJ, Roberts MJ, Norman GR et al (2012) Multisource feedback in evaluating the performance of doctors: the example of the UK General Medical Council patient and colleague questionnaires. Acad Med 87(12):1668–1678. doi:10.1097/ACM.0b013e3182724cc0 View ArticlePubMedGoogle Scholar