Data management

Overview

Data management is an ongoing task in an outbreak investigation. Data management issues include the collection, storage, and security of information. Attention to the details of data collection, including tool creation and data entry itself, facilitates the creation of valid and reliable results upon which to base conclusions and recommendations. Having a clear data management plan will facilitate communication and coordination throughout the investigation and is a key part of both descriptive studies and analytic studies.

Software options

There are a wide range of software options for data management. The following table summarizes the most common software programs and how they perform for three key functions:

Data entry
Basic data analysis
Advanced data analysis

Platform	Data entry	Basic data analysis	Advanced data analysis
Microsoft Excel	Yes	Yes	No
Microsoft Access	Yes (multiple users)	Limited	No
EpiData	Yes	Yes	Yes (limited)
SAS	n/a	Yes	Yes
Stata	n/a	Yes	Yes
SPSS	n/a	Yes	Yes
Online Survey Platforms (e.g., Voxco)	Yes (multiple users)	Yes	No

Additional points to consider:

Cost
Graphics capabilities
Speed on network (if applicable)
Ability to modify
Current skill level of employees
User support

General guidelines and considerations

Planning

Design data collection tools (e.g., questionnaires) with both a data management and data analysis plan.
Keep a paper or electronic log book of all data management decisions made during the investigation.
Create a communication and information dissemination plan ahead of time to facilitate information sharing and flow under outbreak conditions.

Design considerations

The format of the data collection tool/questionnaire (paper vs. electronic entry form) and the questions themselves (open-ended text vs. forced choice) impact and possibly limit what type of analyses can be performed and what conclusions can be drawn.
Be aware of how certain types of data such as blank rows and columns, missing values, column headings or variable names are handled when moving data between software products to maintain data integrity (e.g., date fields).
Ensure that each variable measures, quantifies, or qualifies only one specific thing (as opposed to multiple).
Clarify how many and which characters are allowed for variable names and all fields in the software product(s) you are using.
Name variables in data collection tools with short and descriptive names that make sense to you and are more likely to make sense to others (e.g., id, sex, onsetdate, and serotype and not VAR1, VAR2, VAR3 and VAR4, respectively).
Allow space for a unique questionnaire ID number or pre-assign ID numbers.

Improving data quality

Before collecting data, pilot data collection tools to clarify and ensure that appropriate data is being collected with the variables used.
Train interviewers on every questionnaire. Interviewers must understand all questions.
Include a place to identify interviewers on the questionnaire and so that data entry personnel may follow-up if required (e.g., unanswered questions, illegible handwriting, nonsensical responses).
During data analysis keep a record in the log book of all recoded and newly created variables.
Pay attention to warnings that arise in analyses; if unsure about the meaning of a warning, get clarity to avoid misinterpreting results.

Version control

Name files using unique identifiers that include the date (e.g., Salmonella Data 2Feb2011.xxx; Salmonella Linelist 4Feb2012.xxx).
Save the files often while creating documents (e.g., line list, questionnaires, epidemic curve, reports), entering data, and executing any other outbreak-related activities.
Back up your files regularly onto both a secure hard drive (preferably a secure network drive that is backed up) and on an external memory device (e.g., USB stick/flash drive). Ensure the security of all memory devices (e.g., passwords on computers and external memory devices; external memory devices locked in cabinets and all hardware kept in locked rooms).

Data dictionaries

A data dictionary provides a descriptive list of names, definitions, and attributes of the data elements in a dataset or database (e.g., line lists, database with questionnaire responses). For each data element, information such as descriptive name, the data type, allowed values, units, and text description is documented and described. The development and use of a consistent set of data elements and formats for documenting the database content in this manner helps to ensure data consistency, standardization, accuracy, and reliability and thereby also facilitates comparability, reporting, and outbreak investigation.

The table below provides a description of commonly used attributes in a data dictionary. These are provided as a guide only, as each database is unique and there is therefore variability among data dictionaries.

Table: Description of common data dictionary characteristics

Characteristic	Description
Name	Commonly agreed, unique data element name.
Field name	Name used for data element in computer programs and database schemas.
Definition	Description of meaning of data element.
Unit of measure	Scientific or other unit of measure that applies to data value.
Value	The reported value.
Data type	Data type (text, numeric, date, yes/no, etc).
Size	Maximum field length as measured in characters and/or number of decimal places.
Field Constraints	Required/conditional/null.
Coding / Values	Explanation of coding for acceptable values and validation rules.
Data source	Short description of source of data. Includes rules used in calculations to produce data element value.
Related data elements	Names of closely related data elements when relationship is important.
Input mask	Required data layout (e.g., yyyy-mm-dd).
History references	Date when data element was defined in present form, previous definitions, etc.
Comments	To provide additional details related to a data element.

Tools

Toolkit line list and data dictionary

This Microsoft Excel-based tool is designed to be used as a template for foodborne outbreak investigation line lists. Once data has been entered, common descriptive statistics are automatically calculated. A data dictionary describing each data field in the line list is available in the final tab.

EpiData Software

Epidata software is a free and open source software created for epidemiologists with two components, EpiData Entry and EpiData Analysis. EpiData Entry is primarily used for simple data entry and data documentation. EpiData Analysis performs basic statistical analysis, graphs and comprehensive data management.

References

Gregg, M.B (ed.). 2002. Field Epidemiology, 2nd Edition. Oxford University Press, Oxford, England.