On this page
Overview
Data management is an ongoing task in an outbreak investigation. Data management issues include the collection, storage, and security of information. Attention to the details of data collection, including tool creation and data entry itself, facilitates the creation of valid and reliable results upon which to base conclusions and recommendations. Having a clear data management plan will facilitate communication and coordination throughout the investigation and is a key part of both descriptive studies and analytic studies.
Software options
There are a wide range of software options for data management. The following table summarizes the most common software programs and how they perform for three key functions:
- Data entry
- Basic data analysis
- Advanced data analysis
Platform | Data entry | Basic data analysis | Advanced data analysis |
---|---|---|---|
Microsoft Excel | Problematic | Yes | No |
Microsoft Access | Yes (multiple users) | Limited | No |
EpiData | Yes | Yes | Yes (limited) |
Epi Info 7 | Yes | Yes (limited) | Yes (limited) |
SAS | n/a | Yes | Yes |
Stata | n/a | Yes | Yes |
SPSS | n/a | Yes | Yes |
Online Survey Platforms (e.g., Voxco) | Yes (multiple users) | Yes | No |
Additional points to consider:
- Cost
- Graphics capabilities
- Speed on network (if applicable)
- Ability to modify
- Current skill level of employees
- User support
General guidelines and considerations
Planning
- Design data collection tools (e.g., questionnaires) with both a data management and data analysis plan.
- Keep a paper or electronic log book of all data management decisions made during the investigation.
- Create a communication and information dissemination plan ahead of time to facilitate information sharing and flow under outbreak conditions.
Design considerations
- The format of the data collection tool/questionnaire (paper vs. electronic entry form) and the questions themselves (open-ended text vs. forced choice) impact and possibly limit what type of analyses can be performed and what conclusions can be drawn.
- Be aware of how certain types of data such as blank rows and columns, missing values, column headings or variable names are handled when moving data between software products to maintain data integrity (e.g., date fields).
- Ensure that each variable measures, quantifies, or qualifies only one specific thing (as opposed to multiple).
- Clarify how many and which characters are allowed for variable names and all fields in the software product(s) you are using.
- Name variables in data collection tools with short and descriptive names that make sense to you and are more likely to make sense to others (e.g., id, sex, onsetdate, and serotype and not VAR1, VAR2, VAR3 and VAR4, respectively).
- Allow space for a unique questionnaire ID number or pre-assign ID numbers.
Improving data quality
- Before collecting data, pilot data collection tools to clarify and ensure that appropriate data is being collected with the variables used.
- Train interviewers on every questionnaire. Interviewers must understand all questions.
- Include a place to identify interviewers on the questionnaire and so that data entry personnel may follow-up if required (e.g., unanswered questions, illegible handwriting, nonsensical responses).
- During data analysis keep a record in the log book of all recoded and newly created variables.
- Pay attention to warnings that arise in analyses; if unsure about the meaning of a warning, get clarity to avoid misinterpreting results.
Version control
- Name files using unique identifiers that include the date (e.g., Salmonella Data 2Feb2011.xxx; Salmonella Linelist 4Feb2012.xxx).
- Save the files often while creating documents (e.g., line list, questionnaires, epidemic curve, reports), entering data, and executing any other outbreak-related activities.
- Back up your files regularly onto both a secure hard drive (preferably a secure network drive that is backed up) and on an external memory device (e.g., USB stick/flash drive). Ensure the security of all memory devices (e.g., passwords on computers and external memory devices; external memory devices locked in cabinets and all hardware kept in locked rooms).
Data dictionaries
A data dictionary provides a descriptive list of names, definitions, and attributes of the data elements in a dataset or database (e.g., line lists, database with questionnaire responses). For each data element, information such as descriptive name, the data type, allowed values, units, and text description is documented and described. The development and use of a consistent set of data elements and formats for documenting the database content in this manner helps to ensure data consistency, standardization, accuracy, and reliability and thereby also facilitates comparability, reporting, and outbreak investigation.
The table below provides a description of commonly used attributes in a data dictionary. These are provided as a guide only, as each database is unique and there is therefore variability among data dictionaries.
Table: Description of common data dictionary characteristics
Tools
Toolkit line list and data dictionary
- This Microsoft Excel-based tool is designed to be used as a template for foodborne outbreak investigation line lists. Once data has been entered, common descriptive statistics are automatically calculated. A data dictionary describing each data field in the line list is available in the final tab.
Toolkit outbreak response database
- This Microsoft Access Database will follow the layout and structure of the PHAC enteric hypothesis generating questionnaires. Users will be able to enter data, export select fields to a Microsoft Excel line list, generate automatic food and risk exposure summary tables, and run custom queries. Due to the Government of Canada’s Standard on Web Accessibility, this tool cannot be posted, but it is available upon request. Please contact us at info@outbreaktools.ca to request a copy. Please let us know if you need support or an accessible format.
- Epidata software is a free and open source software created for epidemiologists with two components, EpiData Entry and EpiData Analysis. EpiData Entry is primarily used for simple data entry and data documentation. EpiData Analysis performs basic statistical analysis, graphs and comprehensive data management.
- Epi Info™ is a public domain suite of software tools designed for public health practitioners and researchers. It provides for data entry form and database construction and data analyses with epidemiologic statistics, maps, and graphs for public health professionals who may lack an information technology background.
- Epi Info 7 User Guides
- Tutorials
- Epi Info Community Group: This discussion board allows community members to post and reply to questions related to the Epi Info software and share training materials.
References
Gregg, M.B (ed.). 2002. Field Epidemiology, 2nd Edition. Oxford University Press, Oxford, England.