All research collects data of some kind. To make sense of the data, you have to analyze it. The analysis begins with the labeling of the data as to its source, how it was collected, the information it contains, etc. However, working with alphanumeric data can be very cumbersome, whether it’s hundreds of mailed questionnaires, figures on the annual accident rates for all fifty states, or observations of schoolchildren’s behavior in the classroom. For this reason, data is often encrypted.
Coding allows the researcher to reduce large amounts of information to a form that can be more easily handled, especially by computer programs. Not all data needs to be encrypted. For example, the crash rates for the fifty states would not be coded, but each state could be assigned a number (1 through 50) instead of using the state name. There are also content analysis software programs that help researchers encode textual data for qualitative or quantitative analysis.
Alphanumeric Data Management Steps
Prepare the data collection instrument and collect the data
Create the data dictionary or codebook
Prepare data matrix worksheets
Prepare instructions for data entry and analysis.
Example: Quality of Work Life Questionnaire
- Name of the Division where you work: _____________________________
- How long have you been an employee at this company? _______years
- How many county-sponsored training sessions have you attended? _____
- What is your job classification? _____ Management _____ Technical _____ Administrative
- Is your position _____supervisory _____unsupervised
- Sex _____Female _____Male
- In which area would you like to receive additional training? ___________
Prepare the data dictionary or codebook
If the data is to be entered into a computer program, be it a spreadsheet, database, or statistical program, it must be entered in exactly the same way for each person, questionnaire, state, or other unit of analysis.
Many computer programs have limits to how data can be entered, stored, and retrieved. These limits should be reflected in the code book. For example, variable names often cannot exceed eight characters. Use short variable names, preferably with all letters. In general, both numbers and letters can be used in variable names, but spaces, punctuation marks, or other special characters cannot be used.
The variable names you assign to the data should reflect the nominal definitions of the variables themselves, such as “age”, “job class”, “seniority”, etc. You can adopt a rule such as using only lowercase letters for any alphanumeric data you enter, or using only uppercase letters. This will make it easier to type the variable names later, when you need to tell the computer program which variables to analyze.
Data can be stored as letters, which is called an alphanumeric format. This allows the variable to be stored as letters or numbers or a combination of both. For example, first names can be stored, such as “Amy”, “Brad”, “Caroline”, etc. or combinations such as apartment numbers (102b), or license plate numbers (3XGJ429), etc.
The codebook tells the coder how each questionnaire will be coded for data entry. Specifies the quiz question from which the data is taken, the name of the variable, the operational definition of the variable, the encoding options, and the type of variable (numeric or alphanumeric), and the number of columns the variable requires.
String variables have values that are treated as text. This means that the values of string variables can include numbers, letters, or symbols. In the Data View window, missing string values will appear as blank cells. Note, however, that these blank cells are not recognized by SPSS as system-missing values (that is, SPSS considers even blank strings non-missing). This has important implications if you plan to use a string variable in an analysis, as it will affect the sample size.
The researcher has to try to anticipate what the data will be like. A good idea of this can be obtained by doing a pilot test of the instrument and a simulation of the data collection process. It is important to ensure that you leave enough columns to properly code the information for each variable and provide enough variables to capture the full richness, complexity, and variety of data that has been collected.
If a sample of college students is asked about the obstacles they encounter when trying to use the campus library, will they be asked to list the main obstacle, to rank all the obstacles, or to choose only the ones that affect them? And what happens if the students do not follow the instructions? Depending on the way the data is presented, the researcher will have to decide how to code this information, using one, two or many variables.
Prepare data matrix worksheets
When data is to be entered into a computer program for statistical analysis, it is usually done in matrix form. Variable names are entered at the top of the columns that will contain the data for that variable, and case records are entered in the rows. Continuing with the previous example we would have:
data entry worksheets
Work life quality code book
Columns 1 through 3 together represent the employee identification number.
Column 4 represents the division in which you have worked.
Columns 5-6 represent the length of employment.
Columns 7-8 represent the number of training classes received (information on the number of classes received).
Column 9 represents the person’s job classification.
The number 10 indicates whether the person is a supervisor or not.
Column 11 indicates whether the person is male or female.
Column 12 indicates the type of training that the person wishes to receive in the future.
Each record must be entered in exactly the same way. If the position of the data must be entered in fixed columns, it is called a fixed field format. If there is missing data for a record in any of the variables, something must be entered in that field. Typically this is a number indicating missing data. For a 1-column field, use the number 9; for a two-column field, use 99 and and so on. Some software will allow you to use a dot (“.”) as a placeholder which is also an indicator of missing data.
When you need to calculate the average length of employment for all employees in your survey, the computer will look at columns 5-6 of each record. It will take what it finds there and try to calculate an average. Therefore, it is important that all data on duration of employment is in columns 5-6 of each record and that no other data is in columns 5-6. Missing data codes (ie values of “99”) will be ignored by the computer in calculating the mean.
Many computer programs are limited to a total of 80 columns of data per record. This is a holdover from the days when data was punched into cardboard cards and fed into card readers, rather than entering the data directly into the computer. If your data requires more than 80 columns, you will need to build additional data arrays to record the rest of the information for each record.
Prepare instructions for data entry and data analysis
Data coding can be done directly on the data collection instrument and then transferred to data coding sheets or entered directly into the computer. It is important to prepare detailed instructions for data coding and data entry, especially if these tasks are shared or performed by several different people.
There are a number of statistical programs, spreadsheets and databases that can be used for data entry. Most programs save the data and allow it to be output as a plain text or ASCII file, which is accepted by most statistical programs, such as SAS, SPSS, or STATA. Most of these programs are available in a desktop version, and many also come in cheaper student versions, such as Student Stata and Mystat.
There are also a number of stand-alone products, such as DataPerfect, that can be easily programmed to look like the data collection instrument, making data entry quite easy and eliminating the need to fill in a data entry matrix. These programs also have built-in safeguards, so that, for example, you can’t enter alphanumeric data into a variable that is for numeric data only. Also, the data is restricted to a limited number of columns, so four digits cannot be entered into a three-digit variable, etc.