Edit - I had to rebuild this post and some of the original details on the target website originally used in the web scraping have been changed by the site maintainer. Rather than redo all the work, this content is rebuilt using print statements where the dynamic output has failed. The original approach and thinking that I had hoped to convey to the reader remains.
Introduction
This is the third post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.
It’s a good time to note that data scraping is inherently trial and error, and the best way forward is often not clear. During the course of this project I went down a number of dead-ends, so you will invariably need to approach your particular problem from several angles before you find one that works or scales appropriately.
Partitioning the Problem
By the end of the last post, I had produced a breakdown of the number of war dead commemorated in each country, which I load here:
1
2
|
# Load data from previous post
load(file = "path/to/file/Summary_data_countries_commemorated.rda")
|
I previously discussed how the website’s spreadsheet download option will only permit a maximum of 80,000 rows, so results for any country with less than this number of graves may be downloaded with no further issues. Sorting the data and taking the top 10 values:
1
2
3
4
5
|
# Order data by number of records (descending)
cwgc_summary <- cwgc_summary[order(-cwgc_summary$sum),]
# Show the top 10 values
knitr::kable(head(cwgc_summary, 10), row.names = FALSE, caption = "Number of Records by Country and War")
|
Table: Table 1: Number of Records by Country and War
commemorated |
war-1 |
war-2 |
sum |
France |
538841 |
46260 |
585101 |
United Kingdom |
134962 |
183130 |
318092 |
Belgium |
197351 |
10397 |
207748 |
Civilian War Dead |
0 |
68887 |
68887 |
India |
18788 |
43342 |
62130 |
Iraq |
51427 |
3002 |
54429 |
Italy |
3809 |
45555 |
49364 |
Egypt |
14908 |
28568 |
43476 |
Myanmar |
112 |
38388 |
38500 |
Turkey (including Gallipoli) |
36800 |
30 |
36830 |
Only France, United Kingdom and Belgium exceeded the threshold of 80,000 records and it was simple to develop a loop to download the graves data for the other countries below this threshold. The website does not support a search with multiple countries, so the loop downloaded one spreadsheet per country. Also, I had to remove the ‘Civilian War Dead’ category since it could not be downloaded with the other countries using the same method.
1
2
3
4
5
6
7
8
9
10
11
|
# Filter data frame for countries with fewer than 80,000 records
cwgc_set_1 <- cwgc_summary[cwgc_summary$sum < 80000,]
# Discard countries with zero graves
cwgc_set_1 <- cwgc_set_1[cwgc_set_1$sum != 0,]
# Get vector of countries
cwgc_set_1 <- as.character(cwgc_set_1$commemorated)
# Remove 'Civilian War Dead' category
cwgc_set_1 <- cwgc_set_1[2:length(cwgc_set_1)]
|
I denoted Set 1 as a list of 144 countries whose results each fit into a single spreadsheet download and Set 2 as the three countries with greater than 80,000 records. Data for countries in Set 2 had to be split up in some manner before downloading. These two sets required a different approach and will be dealt with separately in the next sections.
1
2
3
4
5
|
# Filter data frame for countries with an individual sum > 80,000
cwgc_set_2 <- cwgc_summary[cwgc_summary$sum > 80000,]
# Get vector of countries
cwgc_set_2 <- as.character(cwgc_set_2$commemorated)
|
Show Set 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
|
## [1] "India"
## [2] "Iraq"
## [3] "Italy"
## [4] "Egypt"
## [5] "Myanmar"
## [6] "Turkey (including Gallipoli)"
## [7] "Germany"
## [8] "Singapore"
## [9] "Netherlands"
## [10] "Canada"
## [11] "Greece"
## [12] "Israel and Palestine (including Gaza)"
## [13] "Australia"
## [14] "South Africa"
## [15] "Tunisia"
## [16] "Thailand"
## [17] "Papua New Guinea"
## [18] "Kenya"
## [19] "Libya"
## [20] "Malta"
## [21] "Tanzania"
## [22] "Nigeria"
## [23] "Malaysia"
## [24] "Iran"
## [25] "New Zealand"
## [26] "Indonesia"
## [27] "Algeria"
## [28] "Sri Lanka"
## [29] "Sierra Leone"
## [30] "Ghana"
## [31] "Japan"
## [32] "Lebanese Republic"
## [33] "Sudan"
## [34] "Pakistan"
## [35] "Bangladesh"
## [36] "Syria"
## [37] "Poland"
## [38] "United States of America"
## [39] "Denmark"
## [40] "Lesotho"
## [41] "Eritrea"
## [42] "Norway"
## [43] "Gibraltar"
## [44] "New Caledonia"
## [45] "Zimbabwe"
## [46] "Namibia"
## [47] "Russian Federation"
## [48] "Somalia (including Somaliland)"
## [49] "Austria"
## [50] "Serbia"
## [51] "Uganda"
## [52] "Ethiopia"
## [53] "Yemen"
## [54] "Cyprus"
## [55] "Seychelles"
## [56] "Malawi"
## [57] "Madagascar"
## [58] "Zambia"
## [59] "Mozambique"
## [60] "Gambia"
## [61] "Jamaica"
## [62] "Bulgaria"
## [63] "Iceland"
## [64] "Czech Republic"
## [65] "Hungary"
## [66] "Romania"
## [67] "Botswana"
## [68] "Sweden"
## [69] "Trinidad and Tobago"
## [70] "Bermuda"
## [71] "Switzerland"
## [72] "Spain"
## [73] "Fiji"
## [74] "Mauritius"
## [75] "Eswatini"
## [76] "Maldives"
## [77] "Georgia"
## [78] "Morocco"
## [79] "Bahamas"
## [80] "Faroe Islands"
## [81] "Azores"
## [82] "Portugal"
## [83] "Belize"
## [84] "Albania"
## [85] "Azerbaijan"
## [86] "Cameroon"
## [87] "St. Lucia"
## [88] "Falkland Islands"
## [89] "Latvia"
## [90] "Congo (Democratic Republic)"
## [91] "Congo"
## [92] "Luxembourg"
## [93] "Barbados"
## [94] "Brazil"
## [95] "Guyana"
## [96] "Senegal"
## [97] "St. Helena and Ascension Island"
## [98] "St. Vincent"
## [99] "Argentina"
## [100] "Estonia"
## [101] "Panama"
## [102] "Djibouti"
## [103] "Liberia"
## [104] "Netherlands Antilles"
## [105] "Samoa"
## [106] "Equatorial Guinea"
## [107] "British Indian Ocean Territory"
## [108] "Cape Verde"
## [109] "Uruguay"
## [110] "Madeira"
## [111] "Peru"
## [112] "Chile"
## [113] "Mauritania"
## [114] "Norfolk Island"
## [115] "Solomon Islands"
## [116] "Chad"
## [117] "Grenada"
## [118] "Society Islands"
## [119] "Cuba"
## [120] "Dominica"
## [121] "Mali"
## [122] "Puerto Rico"
## [123] "Tonga"
## [124] "Antigua"
## [125] "Croatia"
## [126] "Guinea"
## [127] "Lithuania"
## [128] "Monaco"
## [129] "Oman"
## [130] "Philippines"
## [131] "San Marino"
## [132] "St. Christopher and Nevis"
## [133] "Ukraine"
## [134] "Vanuatu"
## [135] "Costa Rica"
## [136] "Guatemala"
## [137] "Honduras"
## [138] "Martinique"
## [139] "Nepal"
## [140] "Saudi Arabia"
## [141] "Togo"
|
Show Set 2
1
|
## [1] "France" "United Kingdom" "Belgium"
|
Downloading Set 1
Using a process similar to what I covered previously, I downloaded the spreadsheets by looping through all the sites in Set 1. First I defined the URL, set up a HTML session and retrieved the search form fields. I started out intending to use the html_form()
function, but I quickly ran into some issues trying to set the parameters.
1
2
3
4
5
6
7
8
|
# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"
# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)
# Parse form fields from URL
page_form <- html_form(pgsession)[[3]]
|
When I submitted the form for a single country, I wasn’t getting the expected response back from the website. Because I wanted the results from both wars (WWI and WWII), I simply omitted the ‘War’ variable, with the (incorrect) assumption that this would include results from both wars. However, it seems the filled form defaulted to searching for something else altogether and my failure to set the form values correctly resulted in the unexpected output.
1
2
3
4
|
# Set form fields with country to be queried
# Not working as expected
filled_form <- set_values(page_form,
"countryCommemoratedAt" = 'India')
|
Inspecting the forms’ HTML elements in Chrome showed that the two check-boxes to set the War had the same Id, and hunting around on the Internet revealed that the rvest
package cannot handle multiple form fields with the same name. I struggled to use the submit_form()
function and my efforts to set the check-box form fields correctly were unsuccessful. This appears to be a known issue with rvest
but there is zero appetite to resolve it considering the package has a git branch with a fix that has been waiting for over three years. I should have used Selenium! [Edit: since I originally worked on this project, I see that the CWGC website has been updated and the elements have different Ids]
All was not lost however - a simple alternative is to manually generate the request URLs based on the string pattern generated by a manual search. Frankly, this approach would have saved me a whole lot of heartache and time. A quick search on the website showed the URL pattern to emulate:
1
2
3
4
5
6
7
|
# URL pattern to reproduce
# https://www.cwgc.org/find/find-war-dead/results?country=Germany
# Construct URL
baseurl <- "https://www.cwgc.org/find/find-war-dead/results?country="
mainurl <- paste(baseurl, country, sep = "")
mainurl <- gsub(" ", "%2B", mainurl) # Replace spaces if present
|
The next step was to retrieve the hyperlink for downloading the spreadsheet. I did a manual search on the website by setting the Country drop-down to a random country and left the remaining search fields blank. Once this had run, I used Chrome Dev tools to copy the XPath from the ‘Download Results’ button and retrieved the corresponding href attribute value.
1
2
3
4
5
6
7
8
9
|
# Read HTML contents
dat <- read_html(mainurl)
# Get contents of HTML element
dat_html <- html_nodes(dat,
xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")
# Get attribute
dat_attr <- html_attr(dat_html, "href")
|
The attribute is a relative hyperlink, so I had to stitch on the preceding part of the URL:
1
2
|
# Concatenate URL and relative URL
download_address <- paste0("https://www.cwgc.org", dat_attr)
|
Finally, I downloaded the file using the download.file()
function, which unsurprisingly downloads a file using a URL. The file is downloaded to the current working directory - if in any doubt use getwd()
to check where you are.
1
2
|
# Download file from URL - the curl method removes blank rows in the output
download.file(download_address, method = "curl", "India.csv")
|
At this point I had figured out the mechanics of downloading a single country’s casualties and I simply wrapped the preceding code snippets in a loop to step through the countries in Set 1. I also created a sub-directory called ‘Set1’ for the results.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
# Set path for downloaded files
path <- "./Set1/"
# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"
# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)
# Loop to step through countries
for (country in cwgc_set_1){
# Console output
cat("Working on", country, "\n")
# Construct URL
baseurl <- "https://www.cwgc.org/find/find-war-dead/results?country="
mainurl <- paste(baseurl, country, sep = "")
mainurl <- gsub(" ", "%2B", mainurl)
# Read HTML contents
dat <- read_html(mainurl)
# Get contents of HTML element with XPath option
dat_html <- html_nodes(dat,
xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/div[2]/a")
# Get href attribute
dat_attr <- html_attr(dat_html, "href")
# Concatenate URL and relative URL
download_address <- paste0("https://www.cwgc.org", dat_attr)
# Set up file name for each country
file_name <- paste0(path, country, ".csv")
# Console output
cat("Downloading", country, "\n")
# Download file from URL
# This function's method parameter depends on your OS, so pick accordingly
download.file(download_address, method = "curl", file_name)
# Wait between 100 and 200 seconds before next page request
Sys.sleep(sample(seq(100, 200, 1), 1))
}
|
Once this was done, I downloaded the ‘Civilian War Dead’ category manually, since coding it was more effort than it was worth.
Merging Set 1
The final task was to read in all the *.csv files and save them into a *.rda object.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
# Set path for downloaded files
path <- "./Set1/"
# List of *.csv files
csv_files <- list.files(path, pattern = "\\.csv$")
# Initialise object for output
out_file <- c()
# Loop through each file
for (my_file in csv_files){
# Set up file name
file_name <- paste0(path, my_file)
# Read in file
dat <- read.csv(file_name, header = TRUE, sep = ",")
# Console output for progress indicator
cat(my_file, " \n")
# Bind results
out_file <- rbind(out_file, dat)
}
# Save output to file
save(out_file, file = "./path/to/file/Set_1_cwgc_graves.rda")
|
A quick check showed the number of rows in the output data frame matches the expected number of records for all countries in Set 1.
Downloading Set 2
Set 2 consists of the United Kingdom, France and Belgium and the challenge was to search the website to produce results with no more than 80,000 records. First, let’s reassess where we were with the number of records in Set 2.
1
2
3
|
# Graves per country in Set 2
knitr::kable(cwgc_summary[cwgc_summary$commemorate %in% cwgc_set_2,],
row.names = FALSE, caption = "Records in Set 2 Countries")
|
Table: Table 2: Records in Set 2 Countries
commemorated |
war-1 |
war-2 |
sum |
France |
538841 |
46260 |
585101 |
United Kingdom |
134962 |
183130 |
318092 |
Belgium |
197351 |
10397 |
207748 |
The easiest way to proceed was to split these countries’ war dead into sets using the war and date ranges. I had to spend a little time manually searching to figure out the date splits, since I didn’t want to do it programatically considering the potential impact on the website. There were four chunks of data that need to be split - France WWI, Belgium WWI, UK WWI and UK WWII. I summarised the date splits in a spreadsheet and imported the dates.
1
2
3
4
5
6
7
8
9
10
11
|
# Read in date splits
dates <- read.csv(file = "Dates.csv",
sep = ",",
header = FALSE,
stringsAsFactors = FALSE)
# Add column names
colnames(dates) <- c("Country", "Start", "End")
# Show table
knitr::kable(dates, row.names = FALSE, caption = "Set 2 Date Splits")
|
Table: Table 3: Set 2 Date Splits
Country |
Start |
End |
France |
01/07/1914 |
31/12/1914 |
France |
01/01/1915 |
30/06/1915 |
France |
01/07/1915 |
31/12/1915 |
France |
01/01/1916 |
30/06/1916 |
France |
01/07/1916 |
31/08/1916 |
France |
01/09/1916 |
31/12/1916 |
France |
01/01/1917 |
30/04/1917 |
France |
01/05/1917 |
31/08/1917 |
France |
01/09/1917 |
31/12/1917 |
France |
01/01/1918 |
30/06/1918 |
France |
01/07/1918 |
30/09/1918 |
France |
01/10/1918 |
31/12/1918 |
France |
01/01/1919 |
31/12/1920 |
France |
01/07/1939 |
31/12/1947 |
United Kingdom |
01/07/1914 |
31/12/1917 |
United Kingdom |
01/01/1918 |
31/12/1921 |
Belgium |
01/07/1914 |
30/06/1916 |
Belgium |
01/07/1916 |
30/09/1917 |
Belgium |
01/10/1917 |
31/12/1921 |
Belgium |
01/07/1939 |
31/12/1947 |
United Kingdom |
01/07/1939 |
31/12/1941 |
United Kingdom |
01/01/1942 |
31/12/1943 |
United Kingdom |
01/01/1944 |
31/12/1947 |
I wanted to use the submit_form()
method with the website form but could not get it to work, so I generated the request URLs manually. Doing a random search using the Date of Death option showed the structure of the URL to replicate. The loop draws on code that we have discussed previously, so should be familiar enough. The files were downloaded into a directory called ‘Set2’.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
# URL structure to replicate
#.../find-war-dead/results?country=Belgium&dateFrom=01-01-1915&dateTo=01-01-1916
# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"
# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)
# Read contents of webpage into XML document
webpage <- read_html(pgsession)
# Loop through dates and country
for(i in 1:nrow(dates)){
# Get country
country <- dates$Country[i]
# Replace spaces with '+' in country name
country <- gsub(" ", "+", country)
# Get start date
start_date <- dates$Start[i]
# Change '/' to '-'
start_date <- gsub("/", "-", start_date)
# Get end date
end_date <- dates$End[i]
# Change '/' to '-'
end_date <- gsub("/", "-", end_date)
# Set path
path <- "./Set2/"
# Set up file name
file_name <- paste0(path, country, start_date, end_date, ".csv")
# Construct URL
part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
part_2 <- paste0("country=", country)
part_3 <- paste0("&dateFrom=", start_date)
part_4 <- paste0("&dateTo=", end_date)
request_URL <- paste0(part_1, part_2, part_3, part_4)
# Get contents of webpage
webpage <- read_html(request_URL)
# Get 'Download Results' element
dat_html <- html_nodes(webpage,
xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")
# Get href attribute
dat_attr <- html_attr(dat_html, "href")
# Concatenate URL and relative URL
download_address <- paste0("https://www.cwgc.org", dat_attr)
# Download file from URL
download.file(download_address, file_name)
# Wait between 30 and 60 seconds before next page request
Sys.sleep(sample(seq(30, 60, 1), 1))
}
|
Merging Set 2
As before, we can read in all the *.csv files, merge them and save to a *.rda object.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
# Set path for downloaded files
path <- "./Set2/"
# List of *.csv files
csv_files <- list.files(path, pattern = "\\.csv$")
# Initialise object for output
out_file <- c()
# Loop through each file
for (my_file in csv_files){
# Set up file name
file_name <- paste0(path, my_file)
# Read in file
dat <- read.csv(file_name, header = TRUE, sep = ",")
# Console output for progress indicator
cat(my_file, " \n")
# Bind results
out_file <- rbind(out_file, dat)
}
# Save output to file
save(out_file, file = "./path/to/file/Set_2_cwgc_graves.rda")
|
I checked that the number of rows in the output data frame equaled the expected number of records for all countries in Set 2. The numbers did not match precisely and the totals were out by ~280. When I dug into the cause, I found that that some additional records had been added into the CWGC database during my analysis, so my summary data did match the data downloaded. Also, there are graves with no date of death belonging to German and Russian soldiers - the searches which used a date of death do not collect these records. These graves more or less accounted for the difference, so I was satisfied that the numbers reconciled.
Combining the Results
The last step was to combine the data from Set 1 and Set 2.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
# Load data from file
load(file = "./path/to/file/Set_1_cwgc_graves.rda")
set1 <- out_file
# Load data from file
load(file = "./path/to/file/Set_2_cwgc_graves.rda")
set2 <- out_file
# Row bind
cwgc_graves <- rbind(set1, set2)
# Save output to file
save(cwgc_graves, file = "./path/to/file/All_cwgc_graves.rda")
|
Conclusion
This post has demonstrated how to automatically search for and download the 1.7m graves records from the CWGC website.