Edit - I had to rebuild this post and some of the original details on the target website originally used in the web scraping have been changed by the site maintainer. Rather than redo all the work, this content is rebuilt using print statements where the dynamic output has failed. The original approach and thinking that I had hoped to convey to the reader remains.
Introduction
This is the third post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.
It’s a good time to note that data scraping is inherently trial and error, and the best way forward is often not clear. During the course of this project I went down a number of dead-ends, so you will invariably need to approach your particular problem from several angles before you find one that works or scales appropriately.
Partitioning the Problem
By the end of the last post, I had produced a breakdown of the number of war dead commemorated in each country, which I load here:
1
2
# Load data from previous postload(file="path/to/file/Summary_data_countries_commemorated.rda")
I previously discussed how the website’s spreadsheet download option will only permit a maximum of 80,000 rows, so results for any country with less than this number of graves may be downloaded with no further issues. Sorting the data and taking the top 10 values:
1
2
3
4
5
# Order data by number of records (descending)cwgc_summary<-cwgc_summary[order(-cwgc_summary$sum),]# Show the top 10 valuesknitr::kable(head(cwgc_summary,10),row.names=FALSE,caption="Number of Records by Country and War")
Table: Table 1: Number of Records by Country and War
commemorated
war-1
war-2
sum
France
538841
46260
585101
United Kingdom
134962
183130
318092
Belgium
197351
10397
207748
Civilian War Dead
0
68887
68887
India
18788
43342
62130
Iraq
51427
3002
54429
Italy
3809
45555
49364
Egypt
14908
28568
43476
Myanmar
112
38388
38500
Turkey (including Gallipoli)
36800
30
36830
Only France, United Kingdom and Belgium exceeded the threshold of 80,000 records and it was simple to develop a loop to download the graves data for the other countries below this threshold. The website does not support a search with multiple countries, so the loop downloaded one spreadsheet per country. Also, I had to remove the ‘Civilian War Dead’ category since it could not be downloaded with the other countries using the same method.
1
2
3
4
5
6
7
8
9
10
11
# Filter data frame for countries with fewer than 80,000 recordscwgc_set_1<-cwgc_summary[cwgc_summary$sum<80000,]# Discard countries with zero gravescwgc_set_1<-cwgc_set_1[cwgc_set_1$sum!=0,]# Get vector of countries cwgc_set_1<-as.character(cwgc_set_1$commemorated)# Remove 'Civilian War Dead' categorycwgc_set_1<-cwgc_set_1[2:length(cwgc_set_1)]
I denoted Set 1 as a list of 144 countries whose results each fit into a single spreadsheet download and Set 2 as the three countries with greater than 80,000 records. Data for countries in Set 2 had to be split up in some manner before downloading. These two sets required a different approach and will be dealt with separately in the next sections.
1
2
3
4
5
# Filter data frame for countries with an individual sum > 80,000cwgc_set_2<-cwgc_summary[cwgc_summary$sum>80000,]# Get vector of countries cwgc_set_2<-as.character(cwgc_set_2$commemorated)
Using a process similar to what I covered previously, I downloaded the spreadsheets by looping through all the sites in Set 1. First I defined the URL, set up a HTML session and retrieved the search form fields. I started out intending to use the html_form() function, but I quickly ran into some issues trying to set the parameters.
1
2
3
4
5
6
7
8
# URL to be scrapedwar_dead_url<-"https://www.cwgc.org/find/find-war-dead"# Set up HTML session for scrapingpgsession<-html_session(war_dead_url)# Parse form fields from URLpage_form<-html_form(pgsession)[[3]]
When I submitted the form for a single country, I wasn’t getting the expected response back from the website. Because I wanted the results from both wars (WWI and WWII), I simply omitted the ‘War’ variable, with the (incorrect) assumption that this would include results from both wars. However, it seems the filled form defaulted to searching for something else altogether and my failure to set the form values correctly resulted in the unexpected output.
1
2
3
4
# Set form fields with country to be queried# Not working as expectedfilled_form<-set_values(page_form,"countryCommemoratedAt"='India')
Inspecting the forms’ HTML elements in Chrome showed that the two check-boxes to set the War had the same Id, and hunting around on the Internet revealed that the rvest package cannot handle multiple form fields with the same name. I struggled to use the submit_form() function and my efforts to set the check-box form fields correctly were unsuccessful. This appears to be a known issue with rvest but there is zero appetite to resolve it considering the package has a git branch with a fix that has been waiting for over three years. I should have used Selenium! [Edit: since I originally worked on this project, I see that the CWGC website has been updated and the elements have different Ids]
All was not lost however - a simple alternative is to manually generate the request URLs based on the string pattern generated by a manual search. Frankly, this approach would have saved me a whole lot of heartache and time. A quick search on the website showed the URL pattern to emulate:
1
2
3
4
5
6
7
# URL pattern to reproduce# https://www.cwgc.org/find/find-war-dead/results?country=Germany# Construct URLbaseurl<-"https://www.cwgc.org/find/find-war-dead/results?country="mainurl<-paste(baseurl,country,sep="")mainurl<-gsub(" ","%2B",mainurl)# Replace spaces if present
The next step was to retrieve the hyperlink for downloading the spreadsheet. I did a manual search on the website by setting the Country drop-down to a random country and left the remaining search fields blank. Once this had run, I used Chrome Dev tools to copy the XPath from the ‘Download Results’ button and retrieved the corresponding href attribute value.
Figure 1. Form search fields and names
1
2
3
4
5
6
7
8
9
# Read HTML contentsdat<-read_html(mainurl)# Get contents of HTML elementdat_html<-html_nodes(dat,xpath="/html/body/main/section[2]/div/div[1]/div/div/div/div/a")# Get attributedat_attr<-html_attr(dat_html,"href")
The attribute is a relative hyperlink, so I had to stitch on the preceding part of the URL:
1
2
# Concatenate URL and relative URLdownload_address<-paste0("https://www.cwgc.org",dat_attr)
Finally, I downloaded the file using the download.file() function, which unsurprisingly downloads a file using a URL. The file is downloaded to the current working directory - if in any doubt use getwd() to check where you are.
1
2
# Download file from URL - the curl method removes blank rows in the outputdownload.file(download_address,method="curl","India.csv")
At this point I had figured out the mechanics of downloading a single country’s casualties and I simply wrapped the preceding code snippets in a loop to step through the countries in Set 1. I also created a sub-directory called ‘Set1’ for the results.
# Set path for downloaded filespath<-"./Set1/"# URL to be scrapedwar_dead_url<-"https://www.cwgc.org/find/find-war-dead"# Set up HTML session for scrapingpgsession<-html_session(war_dead_url)# Loop to step through countries for(countryincwgc_set_1){# Console outputcat("Working on",country,"\n")# Construct URLbaseurl<-"https://www.cwgc.org/find/find-war-dead/results?country="mainurl<-paste(baseurl,country,sep="")mainurl<-gsub(" ","%2B",mainurl)# Read HTML contentsdat<-read_html(mainurl)# Get contents of HTML element with XPath optiondat_html<-html_nodes(dat,xpath="/html/body/main/section[2]/div/div[1]/div/div/div/div/div[2]/a")# Get href attributedat_attr<-html_attr(dat_html,"href")# Concatenate URL and relative URLdownload_address<-paste0("https://www.cwgc.org",dat_attr)# Set up file name for each countryfile_name<-paste0(path,country,".csv")# Console outputcat("Downloading",country,"\n")# Download file from URL# This function's method parameter depends on your OS, so pick accordinglydownload.file(download_address,method="curl",file_name)# Wait between 100 and 200 seconds before next page requestSys.sleep(sample(seq(100,200,1),1))}
Once this was done, I downloaded the ‘Civilian War Dead’ category manually, since coding it was more effort than it was worth.
Merging Set 1
The final task was to read in all the *.csv files and save them into a *.rda object.
# Set path for downloaded filespath<-"./Set1/"# List of *.csv files csv_files<-list.files(path,pattern="\\.csv$")# Initialise object for outputout_file<-c()# Loop through each filefor(my_fileincsv_files){# Set up file namefile_name<-paste0(path,my_file)# Read in filedat<-read.csv(file_name,header=TRUE,sep=",")# Console output for progress indicatorcat(my_file," \n")# Bind resultsout_file<-rbind(out_file,dat)}# Save output to filesave(out_file,file="./path/to/file/Set_1_cwgc_graves.rda")
A quick check showed the number of rows in the output data frame matches the expected number of records for all countries in Set 1.
Downloading Set 2
Set 2 consists of the United Kingdom, France and Belgium and the challenge was to search the website to produce results with no more than 80,000 records. First, let’s reassess where we were with the number of records in Set 2.
1
2
3
# Graves per country in Set 2knitr::kable(cwgc_summary[cwgc_summary$commemorate%in%cwgc_set_2,],row.names=FALSE,caption="Records in Set 2 Countries")
Table: Table 2: Records in Set 2 Countries
commemorated
war-1
war-2
sum
France
538841
46260
585101
United Kingdom
134962
183130
318092
Belgium
197351
10397
207748
The easiest way to proceed was to split these countries’ war dead into sets using the war and date ranges. I had to spend a little time manually searching to figure out the date splits, since I didn’t want to do it programatically considering the potential impact on the website. There were four chunks of data that need to be split - France WWI, Belgium WWI, UK WWI and UK WWII. I summarised the date splits in a spreadsheet and imported the dates.
1
2
3
4
5
6
7
8
9
10
11
# Read in date splitsdates<-read.csv(file="Dates.csv",sep=",",header=FALSE,stringsAsFactors=FALSE)# Add column namescolnames(dates)<-c("Country","Start","End")# Show tableknitr::kable(dates,row.names=FALSE,caption="Set 2 Date Splits")
Table: Table 3: Set 2 Date Splits
Country
Start
End
France
01/07/1914
31/12/1914
France
01/01/1915
30/06/1915
France
01/07/1915
31/12/1915
France
01/01/1916
30/06/1916
France
01/07/1916
31/08/1916
France
01/09/1916
31/12/1916
France
01/01/1917
30/04/1917
France
01/05/1917
31/08/1917
France
01/09/1917
31/12/1917
France
01/01/1918
30/06/1918
France
01/07/1918
30/09/1918
France
01/10/1918
31/12/1918
France
01/01/1919
31/12/1920
France
01/07/1939
31/12/1947
United Kingdom
01/07/1914
31/12/1917
United Kingdom
01/01/1918
31/12/1921
Belgium
01/07/1914
30/06/1916
Belgium
01/07/1916
30/09/1917
Belgium
01/10/1917
31/12/1921
Belgium
01/07/1939
31/12/1947
United Kingdom
01/07/1939
31/12/1941
United Kingdom
01/01/1942
31/12/1943
United Kingdom
01/01/1944
31/12/1947
I wanted to use the submit_form() method with the website form but could not get it to work, so I generated the request URLs manually. Doing a random search using the Date of Death option showed the structure of the URL to replicate. The loop draws on code that we have discussed previously, so should be familiar enough. The files were downloaded into a directory called ‘Set2’.
# URL structure to replicate#.../find-war-dead/results?country=Belgium&dateFrom=01-01-1915&dateTo=01-01-1916# URL to be scrapedwar_dead_url<-"https://www.cwgc.org/find/find-war-dead"# Set up HTML session for scrapingpgsession<-html_session(war_dead_url)# Read contents of webpage into XML documentwebpage<-read_html(pgsession)# Loop through dates and countryfor(iin1:nrow(dates)){# Get countrycountry<-dates$Country[i]# Replace spaces with '+' in country namecountry<-gsub(" ","+",country)# Get start datestart_date<-dates$Start[i]# Change '/' to '-'start_date<-gsub("/","-",start_date)# Get end dateend_date<-dates$End[i]# Change '/' to '-'end_date<-gsub("/","-",end_date)# Set pathpath<-"./Set2/"# Set up file namefile_name<-paste0(path,country,start_date,end_date,".csv")# Construct URLpart_1<-"https://www.cwgc.org/find/find-war-dead/results?"part_2<-paste0("country=",country)part_3<-paste0("&dateFrom=",start_date)part_4<-paste0("&dateTo=",end_date)request_URL<-paste0(part_1,part_2,part_3,part_4)# Get contents of webpagewebpage<-read_html(request_URL)# Get 'Download Results' elementdat_html<-html_nodes(webpage,xpath="/html/body/main/section[2]/div/div[1]/div/div/div/div/a")# Get href attributedat_attr<-html_attr(dat_html,"href")# Concatenate URL and relative URLdownload_address<-paste0("https://www.cwgc.org",dat_attr)# Download file from URLdownload.file(download_address,file_name)# Wait between 30 and 60 seconds before next page requestSys.sleep(sample(seq(30,60,1),1))}
Merging Set 2
As before, we can read in all the *.csv files, merge them and save to a *.rda object.
# Set path for downloaded filespath<-"./Set2/"# List of *.csv files csv_files<-list.files(path,pattern="\\.csv$")# Initialise object for outputout_file<-c()# Loop through each filefor(my_fileincsv_files){# Set up file namefile_name<-paste0(path,my_file)# Read in filedat<-read.csv(file_name,header=TRUE,sep=",")# Console output for progress indicatorcat(my_file," \n")# Bind resultsout_file<-rbind(out_file,dat)}# Save output to filesave(out_file,file="./path/to/file/Set_2_cwgc_graves.rda")
I checked that the number of rows in the output data frame equaled the expected number of records for all countries in Set 2. The numbers did not match precisely and the totals were out by ~280. When I dug into the cause, I found that that some additional records had been added into the CWGC database during my analysis, so my summary data did match the data downloaded. Also, there are graves with no date of death belonging to German and Russian soldiers - the searches which used a date of death do not collect these records. These graves more or less accounted for the difference, so I was satisfied that the numbers reconciled.
Combining the Results
The last step was to combine the data from Set 1 and Set 2.
1
2
3
4
5
6
7
8
9
10
11
12
13
# Load data from fileload(file="./path/to/file/Set_1_cwgc_graves.rda")set1<-out_file# Load data from fileload(file="./path/to/file/Set_2_cwgc_graves.rda")set2<-out_file# Row bindcwgc_graves<-rbind(set1,set2)# Save output to filesave(cwgc_graves,file="./path/to/file/All_cwgc_graves.rda")
Conclusion
This post has demonstrated how to automatically search for and download the 1.7m graves records from the CWGC website.