Data management

On this page

Overview

Data management is an ongoing task in an outbreak investigation. Data management issues include the collection, storage, and security of information. Attention to the details of data collection, including tool creation and data entry itself, facilitates the creation of valid and reliable results upon which to base conclusions and recommendations. Having a clear data management plan will facilitate communication and coordination throughout the investigation and is a key part of both descriptive studies and analytic studies

Back to top

Software options

There are a wide range of software options for data management. The following table summarizes the most common software programs and how they perform for three key functions:

  1. Data entry
  2. Basic data analysis
  3. Advanced data analysis
Platform Data entry Basic data analysis Advanced data analysis
Microsoft Excel Problematic Yes No
Microsoft Access  Yes (multiple users) Limited No
Epi Data Yes Yes Yes (limited)
Epi Info 7 Yes Yes (limited) Yes (limited)
SAS n/a Yes Yes
Stata n/a Yes Yes
SPSS n/a Yes Yes
Online Survey Platforms (e.g., FluidSurveys) Yes (multiple users) Yes No

Additional points to consider:

  • Cost
  • Graphics capabilities
  • Speed on network (if applicable)
  • Ability to modify
  • Current skill level of employees
  • User support

Back to top

General guidelines and considerations

Planning

  • Design data collection tools (e.g., questionnaires) with both a data management and data analysis plan.
  • Keep a paper or electronic log book of all data management decisions made during the investigation.
  • Create a communication and information dissemination plan ahead of time to facilitate information sharing and flow under outbreak conditions.

Design considerations

  • The format of the data collection tool/questionnaire (paper vs. electronic entry form) and the questions themselves (open-ended text vs. forced choice) impact and possibly limit what type of analyses can be performed and what conclusions can be drawn.
  • Be aware of how certain types of data such as blank rows and columns, missing values, column headings or variable names are handled when moving data between software products to maintain data integrity (e.g., date fields).
  • Ensure that each variable measures, quantifies, or qualifies only one specific thing (as opposed to multiple).
  • Clarify how many and which characters are allowed for variable names and all fields in the software product(s) you are using.
  • Name variables in data collection tools with short and descriptive names that make sense to you and are more likely to make sense to others (e.g., id, sex, onsetdate, and serotype and not VAR1, VAR2, VAR3 and VAR4, respectively).
  • Allow space for a unique questionnaire ID number or pre-assign ID numbers.

Improving data quality

  • Before collecting data, pilot data collection tools to clarify and ensure that appropriate data is being collected with the variables used.
  • Train interviewers on every questionnaire. Interviewers must understand all questions.
  • Include a place to identify interviewers on the questionnaire and so that data entry personnel may follow-up if required (e.g., unanswered questions, illegible handwriting, nonsensical responses).
  • During data analysis keep a record in the log book of all recoded and newly created variables.
  • Pay attention to warnings that arise in analyses; if unsure about the meaning of a warning, get clarity to avoid misinterpreting results.

Version control

  • Name files using unique identifiers that include the date (e.g., Salmonella Data 2Feb2011.xxx; Salmonella Linelist 4Feb2012.xxx).
  • Save the files often while creating documents (e.g., line list, questionnaires, epidemic curve, reports), entering data, and executing any other outbreak-related activities.
  • Back up your files regularly onto both a secure hard drive (preferably a secure network drive that is backed up) and on an external memory device (e.g., USB stick/flash drive). Ensure the security of all memory devices (e.g., passwords on computers and external memory devices; external memory devices locked in cabinets and all hardware kept in locked rooms).

Data dictionaries

A data dictionary provides a descriptive list of names, definitions, and attributes of the data elements in a dataset or database (e.g., line lists, database with questionnaire responses). For each data element, information such as descriptive name, the data type, allowed values, units, and text description is documented and described. The development and use of a consistent set of data elements and formats for documenting the database content in this manner helps to ensure data consistency, standardization, accuracy, and reliability and thereby also facilitates comparability, reporting, and outbreak investigation.

The table below provides a description of commonly used attributes in a data dictionary. These are provided as a guide only, as each database is unique and there is therefore variability among data dictionaries.

Table: Description of common data dictionary characteristics

Characteristic Description
 Name  Commonly agreed, unique data element name.
 Field name  Name used for data element in computer programs and database schemas.
 Definition  Description of meaning of data element.
 Unit of measure  Scientific or other unit of measure that applies to data value.
 Value  The reported value.
 Data type  Data type (text, numeric, date, yes/no, etc).
 Size  Maximum field length as measured in characters and/or number of decimal places.
 Field Constraints  Required/conditional/null.
 Coding / Values  Explanation of coding for acceptable values and validation rules.
 Data source  Short description of source of data. Includes rules used in calculations to produce data element value.
 Related data elements  Names of closely related data elements when relationship is important.
 Input mask  Required data layout (e.g., yyyy-mm-dd).
 History references  Date when data element was defined in present form, previous definitions, etc.
 Comments  To provide additional details related to a data element.

Back to top

Tools

Toolkit line list and data dictionary

  • This Microsoft Excel-based tool is designed to be used as a template for foodborne outbreak investigation line lists. Once data has been entered, common descriptive statistics are automatically calculated. A data dictionary describing each data field in the line list is available in the final tab.

Toolkit outbreak response database

  • This Microsoft Access Database will follow the layout and structure of the PHAC enteric hypothesis generating questionnaires. Users will be able to enter data, export select fields to a Microsoft Excel line list, generate automatic food and risk exposure summary tables, and run custom queries. This database tool is expected to be complete by the end of 2015.

EpiData Software

  • Epidata software is a free and open source software created for epidemiologists with two components, EpiData Entry and EpiData Analysis. EpiData Entry is primarily used for simple data entry and data documentation. EpiData Analysis performs basic statistical analysis, graphs and comprehensive data management.

Epi Info

  • Epi Info™ is a public domain suite of software tools designed for public health practitioners and researchers. It provides for data entry form and database construction and data analyses with epidemiologic statistics, maps, and graphs for public health professionals who may lack an information technology background.

Back to top

References

Gregg, M.B (ed.). 2002. Field Epidemiology, 2nd Edition.Oxford University Press, Oxford, England.

Back to top