HIS480 logo

USING SPREADSHEETS FOR HISTORICAL RESEARCH

Copyright 2006, 2011 by Dr. Jim Jones
of West Chester University


Go to the HIS480 Syllabus or Assignments.

INTRODUCTION

Although wordprocessing software works great for taking notes, other kinds of software are more helpful when it comes to analyzing data. In particular, numerical data that refers to money, dates or time periods may conceal patterns that are relevant to our study but invisible to the eye. For instance, the data from the Register of Deaths of Chester County uses numbers to indicate dates of birth and death, age at death, and the duration of each person's last illness. If you knew whether deaths were distributed evenly throughout all twelve months of the year, or if they tended to "bunch up" in certain months, you could determine if the weather was a factor in someone's death. Certainly, you could figure that out by going through the list by hand, but spreadsheet software can accomplish the same thing more quickly and accurately.

A spreadsheet organizes data into rows and columns. In the "Register of Deaths" spreadsheet, each row contains information about a single person while each column contains the one type of information about each person, such as theur last name, cause of death or marital status. For example, in this small spreadsheet, each of the five rows contains data about an individual, while the columns contain the name, year of death and place of death for each person.

Rupert Smith 1898 West Chester
Morris Hudson 1892 Downingtown
Elizabeth Grubb 1902 West Chester
Ann Hudson 1904 Embreeville
Elizabeth Brown 1899 Coatesville

If you want, you can add data to this spreadsheet by including information on more people (more rows) or by including more information about each person (more columns). The complete Register of Deaths spreadsheet contains twelve columns -- including gender, race, profession, place of birth, full date of death, cause of death, and duration of last illness -- and over 16,000 individual rows. [View a sample]

By placing your data in a spreadsheet, you can use software to manipulate it in many useful ways. For non-numerical data, the most common activity is to sort all of the data using one or more of the columns. To illustrate this, think about a telephone book. It displays data sorted by last name first, then by first name, middle initial, and street address. That makes it useful for finding someone whose name you already know, but it is not much help if you want to find who lives next door. For that, sorting the data by street name, followed by house number, produces a list that shows the houses on each street in numerical order; i.e. 122 E. Miner Street, 123 E. Miner Street, 124 E. Miner Street, etc.

Sorting data can answer the question in the first paragraph about whether Chester County deaths were "clustered" in any way. When the Register of Deaths data is sorted by the date each person died, it becomes easier to count the frequency of death in each of the twelve months. Between 1893 and 1907, the most deaths occurred in August-September, followed by January-April, and the lowest number occurred in June. Why? Pause for a moment to think about it, and then click on this link for some ideas.

Although a spreadsheet is useful for analyzing data, it does not always provide the best way to enter the data into your computer. If you are entering place names, months and other text, a wordprocessor will speed up the task and make it easier to find spelling errors. Fortunately, there are several ways to enter data into your computer using a wordprocessor and then move that data into a spreadsheet. This lesson describes two of them.

Both methods begin with typing your data into the computer in a systematic way using separators -- one special character to show the end of each line and a different special character to show where one piece of data ends and the next one begins. Although you could choose any two characters as your separators, the usual choice is a HARD-RETURN at the end of each line and a comma (,) between each pair of data items. Of course, that might cause problems if your data contains commas -- software will interpret Detroit, Michigan as two separate items -- but you can avoid that by either 1) removing commas from your data, 2) using another separator like a semi-colon (;), or 3) placing quotation remarks around every data item so that your data items will be separated by END-QUOTATION COMMA START-QUOTATION (",") and the software will ignore a single COMMA.

The simplest way to move data into a spreadsheet file is by copying the data from a Windows-based wordprocessing file and pasting it into a blank Excel spreadsheet. If your original file used the correct separators and EXCEL is configured correctly, your data will appear instantly as a spreadsheet (which you should immediately save). If things do not work out that smoothly, then a better (but more labor-intensive) way is to edit your data into "comma-delimited" format. Then you can save it and use Excel's IMPORT function to turn the saved file into a spreadsheet.

CREATING A COMMA-DELIMITED FILE

Before you try to convert your data, fix as many errors as you can. Error-free data is almost impossible, but computer software can help with error-checking. By using a spellchecker and SEARCH-AND-REPLACE, and by scanning through your data file for misplaced items, you can eliminate most errors. Here is a list of fixes that will correct many errors, assuming you have used consisent spelling and punctuation, and you used a comma (,) as your data separator. Note that order makes a difference -- you can't search for data items using COMMA SPACE until you've made certain that every comma is followed by a space.

  1. Use FIND-AND-REPLACE to remove any space directly in front of a comma.
  2. Use FIND-AND-REPLACE to make sure that there is exactly one space after every comma by 1) adding a space after every comma and then replacing COMMA SPACE SPACE with COMMA SPACE.
  3. Using FIND-AND-REPLACE, expand leftover abbreviations, add periods after middle initials and change data like "male" and "single" to all lower case.

    ... and if you're comfortable writing macros ...

  4. Write a macro to go through the entire document and verify that each had the correct number of data fields.
  5. Write a macro that copies the Nth data item from each line into a separate file and paste the copy of each data item onto a separate line. If the Nth data item in each line is a date in YYYY/MM/DD format, you can easily find missing dates, dates that are the wrong length, and lines with the wrong type of information in the date column, indicating that other fields in that entry may be out of place.
  6. Repeat the previous step for any other fields that lend themselves to visual inspection.

The following images show two entries from the Register of Deaths as they progress through all of the steps from note- taking to spreadsheet.

Here are the original handwritten notes
handwritten notes from the Register of
Deaths

 

Typed into the computer

p81
Forman,Susan,c,f,65,,,,1907/1/11,wpike,drops,2.5yrs
Forbes,Marguerita,w,f,56,h,,,1906/11/??,cv,cancer,4wks

Abbreviations expanded, page number copied, errors fixed

81, Forman, Susan, colored, female, 65, , , , 1907/01/11, West Pikeland, dropsey, 2 1/2 years
81, Forbes, Marguerita, white, female, 56, married, , , 1906/11/??, Coatesville, cancer, 4 weeks

Notice that I used the abbreviation "h" for married, instead of "m." Why? Because "m" and "f" were alreayd taken for male and female. That left "s," "h" and "w" for single, married and widowed.

The next step is to save your wordprocessing file in a format that Excel can IMPORT.

  1. Begin by saving your file, but use FILE-SAVE AS instead of the regular SAVE. Choose "Text Only" or "MS DOS Text" as your file type and give your file a name whose last three letters end in CSV, like NOTES.CSV. [Other programs may refer to the file type as "ASCII Text" or "Plain Text." Whatever the name, save it with the CSV ending so that Excel will be able to recognize it. If you plan to IMPORT it into another spreadsheet program (such as Quattro Pro or Works), you may need to save it with a TXT ending. Read your program's Help file and/or experiment with different name endings.
  2. Remember where you saved NOTES.CSV so you can find it from your spreadsheet program. Start your spreadsheet software (if it isn't already running) and locate its IMPORT command, or else use FILE-OPEN. Find NOTES.CSV, select the file type as "Text: *.TXT, *.CSV" and click on "OK." Your file should appear on your screen as a spreadsheet within seconds.

    The resulting spreadsheet

    81 Forman Susan colored female 65           1907/01/11 West Pikeland dropsey 2 1/2 years
    81 Forbes Marguerita white female 56 married         1906/11/?? Coatesville cancer 4 weeks

     

    As soon as your spreadsheet is ready, SAVE IT with a new name (like NOTES2). The spreadsheet program will automatically add the extension "XLS," so your file will end up with the name NOTES2.XLS .

  3. You need to add a bibliographic reference to your spreadsheet so that, if you use it in the future, you'll be able to cite the original data, and if a question ever arises about the accuracy of the spreadsheet, you can refer back to your original notes. Insert three rows at the top of the spreadsheet and enter a title and bibliographic reference for the data in cell A1. In the third row, type labels for each of the columns. From left to right, they are page#, last name, first name, race, gender, age-at-death, marital-status, occupation, place-of-birth, date-of-death, location-of-death, cause-of-death, and duration-of-last-illness.
  4. Continue to save your spreadsheet each time you make a significant change. If you have any reason to suspect that a change might have introduced errors, save you spreadsheet with a new name.

PROOFREADING YOUR SPREADSHEET

When your data was in a wordprocessing file, you used a special feature of that program, the spellchecker, to proofread your data. Besides using the spreadsheet's own error-checking features, you want to make sure that information for each characteristic wound up in the right column.

  1. Pick two columns (more, if you can focus) and set their width so that you can read their longest entries. Make the rest of the columns narrow so that they don't get in your way, and leave one new column wide enough to contain notes that you write as you proofread.
  2. Scan down through your spreadsheet and look for irregularities in the first two wide columns. You don't need to actually read the information; just stare at the screen while you "page down" through the data. Don't stop unless you notice something unusual in either of the columns, but if you do, examine that row to see if there is something you need to fix.
  3. As always, save your file as you work. Repeat this process for the other columns in the spreadsheet until you are satisfied with its accuracy.

ORGANIZING DATA WITH A SPREADSHEET

Although a spreadsheet is best at making calculations with numbers, it can organize text data as well. For instance, if you have ever made a list of telephone numbers to hang next to your phone, chances are that you organized it in rows and colums. The following three examples show how spreadsheets can be used to organize and examine different types of data.

Here is part of a spreadsheet that contains a list of European monarchs who reigned from roughly 1000 to 1900. The complete file gives the name of each monarch, the country where he or she ruled, the family name, the beginning and ending dates for the reign, the name of the successor and the monarch's relationship to his or her successor. The following subset lists the monarchs of Spain and shows how the data appears in a spreadsheet:

FAMILY NAME START-DATE END-DATE
Castile Henry IV 1454 1474
Castile Ferdinand 1469 1504
Castile Isabella 1474 1504
Hapsburg Charles I 1517 1556
Hapsburg Philip II 1556 1598
Hapsburg Philip III 1598 1621
Hapsburg Philip IV 1621 1640s?
Hapsburg Charles II 1665 1700
Bourbon Philip V 1700 1746
Bourbon Ferdinand I 1746 1759
Bourbon Charles III 1759 1788
Bonaparte Joseph 1808 1814
Bourbon Charles IV 1788 1819

By using different columns to sort the list, it is possible to extract different types of information from this list. For example, this list is already sorted by start-date -- from it, you can infer age and family relations (i.e. Philip II was Charles I's son). If you added a new column to contain the results of subtracting the start-date from the end-date , and then sorted the spreadsheet using that column, you could find periods of instability and stability when reigns were especially long or short.

Another advantage to storing data in a spreadsheet is the ease with which you can create space for and organize new information as you find it. For instance, you could add columns for the names of each monarch's spouse, their children, cause of death, number of major wars, or anything else that every monarch experienced. On the other hand, if you decided to focus on Charles I (1517-1556) -- known as Charles V after he became the Holy Roman Empire -- then it would not be worth adding columns to this spreadsheet, since it would create a lot of wasted space next the other names. You would be better served by starting a new spreadsheet devoted entirely to the life of Charles I.


Go to the HIS480 Syllabus or Assignments.

 

 

 

 

 

 

 

 

 

 

 

 

Why did more people die in Chester County at the end of the summer between 1893 and 1907? That's when the weather is hottest and driest, so water supplies would have been at their lowest. A look at causes of death confirms that water- borne disease was the main source of fataliies in late summer. The second most deadly period was the winter when temperature extremes and a shortage of fresh food weakened th elderly and infants. June was an especially healthy time because the weather was good, spring rains enabled cows to give more milk, and a variety of early-ripening crops (berries, greens, grains) became available for human consumption. Go back to the article.