Welcome to the course page for Crowdsourced and social media data for research Seminar for the course AG2130 at KTH. On this page I have the materials for the lab sessions where you will follow a guided example in QGIS to gain the following experiences:

Find and download relevant crowdsourced data from Open Street Map using the API query tool OverpassTurbo
Load the data into QGIS
Map the data and draw conclusions about substantive topics
Query the Twitter API to obtain Twitter data using Python

I hope that you enjoy this exercise. The material will remain online for you to access, and myself or Jacob will be around in the lab session to answer any of your questions. If you have questions outside of this time, please do not hesitate to get in touch, my email is reka.solymosi@manchester.ac.uk.

Preparation

Before embarking on this, you will need to have done the following:

Download and Install QGIS an open source mapping software we will be using in this session. You can also watch a video introducing you to QGIS which was part of the Open Data Manchester Pick ‘n’ Mix sessions and is available at this link: https://vimeo.com/417967553.
Create a Twitter Developer account: To participate in the lab session seamlessly, please take time before to create your own developer account with Twitter: https://developer.twitter.com/en/apply-for-access. It is free and doesn’t take long from you, but it takes a few days to process so please do this in advance. Once you’ve done this don’t forget to create a new application.

The data sources

We will be using crowdsourced data from two sources. First we will look at data from Open Street Map (OSM). OSM is a database of geospatial information built by a community of mappers, enthusiasts and members of the public, who contribute and maintain data about all sorts of environmental features, such as roads, green spaces, restaurants and railway stations, amongst many other things, all over the world. As such, it is a prime example of ‘crowdsourced’ open data. You can view the information contributed to Open Street Map using their online mapping platform https://www.openstreetmap.org/. The result of people’s contributions is a database of spatial information rich in local knowledge which provides invaluable information about places and their features, without being subject to strict terms on usage. If you want to learn more about OSM you can watch this video: https://vimeo.com/417135012.

We will also be using social media data from Twitter. Twitter is a microblogging website where users can create posts known as “Tweets” where they can write up to 280 characters of text, and attach images or video or links to these posts, and then share them with the world. Their followers can like, retweet, or comment on these posts. You can read more about twitter on their page here: https://help.twitter.com/en/twitter-guide. Tweets have been used in various research projects, for example to study political polarisation, identify locations of train delays, and even predict the stock market or influenza outbreaks, or crime.

Practical exercise

In this exercise you will acquire a number of different skills, including:

Accessing open data using three different methods:
- Direct download.
- Direct calls to an Application Programming Interface (API).
- Calls to an API using a wrapper.
Cleaning, wrangling, and visualizing open spatial data.
Comparing different sources of open data.

Our aim

In this exercise we will explore crime in and around London Underground stations. We know that environmental features are important when it comes to public transport areas being more or less criminogenic. Studies in Sweden, for instance, have shown the importance of environmental and neighborhood characteristics in determining crime concentrations at underground stations in Stockholm, along with the positioning of stations on the line [@ceccato2013security]. Similar work has been carried out exploring bus stops in Los Angeles [@loukaitou1999hot] and the impact of intensive policing along bus corridors in Merseyside, England [@newton2004crime], amongst others.

Here, we will consider the case of London. Specifically, we will examine crime in and around London Underground stations. We will also query Tweets made in London, and explore what can we find out about what people think of police using social media data from Twitter. We will use various sources of open data to answer this question. This will allow us to explore the strengths and limitations of crowdsourced open data sources.

Accessing data

We will be using three different types of open data in this exercise: public sector data (police recorded crime data and local transport authority data), crowdsourced data (from Open Street Map) and social media data (Twitter). To access them, we will use three different methods, namely, (1) direct download, (2) request to an API interface online, 3) request to an API using Python. This will give you a few different ideas about how you might go about accessing other open data sets relevant to your research. Locating, identifying and learning to access open data is a skill in itself.

Direct download

The simplest way that open data can be made available is through direct download from a website. In such a case, you can visit a website, select some parameters, and save a file containing the data you requested locally on your computer.

In the United Kingdom, police recorded crime data in England and Wales can be accessed this way using an online web portal (https://data.police.uk/) under Open Government Licence. Visiting this website, you will see a welcome message, and six tabs across the top which should read “Home”, “Data”, “API”, “Changelog”, “Contact”, and “About”.

Before we download any data, we can learn more about it by clicking on the “About” tab. This brings up information that is important to review carefully in order to answer the where, why, who, what and when questions posed earlier. Take a moment to read through this information, and take notes on what you think might be relevant for your analysis. For example, if we want to map crimes, we will want to explore if there is any type of “anonymization” that might take place before the data are released (to protect the privacy of the victims). If you read the “About” page, you might find the following note:

Location anonymization. The latitude and longitude locations of Crime and ASB incidents published on this site always represent the approximate location of a crime — not the exact place that it happened.

This indicates that although we get a latitude and longitude coordinate with each crime event, it may only be approximate. This may have implications for our findings!

To then download some data, move on to clicking on the “Data” tab. This should open a page entitled “Data downloads” under which you can see another five tabs: “Custom download”, “Archive”, “Boundaries”, “Open data”, and “Statistical data”. By staying on this “Custom download” page you can select what sort of data you want to download. We can select the time period and police force of interest, and the type of information required (e.g. crimes, stop and search, outcomes). We are also informed that the data are downloaded in comma-separated values format (.csv file extension), which meets our machine-readable, easy-to-manipulate data format requirements.

For this exercise, we are going to use British Transport Police data, a force which operates on railways and light-rail systems across the country, for the month of January in 2020. Select the time period using the dropdown menus, the force using the tickboxes:

Then scroll down and click on the button to generate and download the file:

Save this file locally (in your working directory) to a subfolder named “data”. This will download a compressed file, which you will need to extract. You can do this, if you have an Apple Mac, you can do so by double clicking on it, and if you have a PC the to unzip all the contents of the zipped folder, press and hold (or right-click) the folder, select Extract All, and then follow the instructions.

Once you have done so you will have a folder called 2020-01 with a single .csv file inside.

To load this into QGIS, you open QGIS and click on Layer > Add layer > Add delimited text layer like below:

Or you can click on the icon shortcut for this on the left side menu:

This will bring up a window, and next to the bar beside “File” click on the ... which stands for “Browse”:

This will open a window that you can use to navigate to your downloaded data csv. Find it and click on “Open”:

This will take you back to the initial window but you will see the bar beside “File” now has text in it, which is the path to your data. You will also see many fields are now populated, including:

First record has fieldnames is ticked, so the variables have the names from the column hearders
The geometry definition is point coordinates, this is where we have a column for latitude and for longitude, like we do in this data set
It specifies that the X field is longitude and the Y field is latitude
It specifies the CRS (coordinate reference system) as WGS84

This is all good, so if yours also looks like this then click on “Add” and then “Close”. You will see the file has appeared in QGIS, and you have mapped all the crimes recorded by the British Transport Police in January 2020:

This is a lot of crime, let’s focus on a particular area, in this case the Camden local authority in London. To select only the crimes in Camden, right click the layer, and click on “Attribute Table”.

This will show you all the data you have with each crime point, for example the kind of crime:

It also has a column called ‘LSOA name’. LSOA is lower super output area, a unit of census geography commonly used in the UK.

To select all crimes that happened in Camden, we will use an expression that will select every row (every crime) where the LSOA name contains “Camden”. To do this, on the top menu select the icon that is a yellow box with a purple equation symbol on it which is the “Select features using an expression” button.

This will pop up a window to type in your expression. The expression you need here is to specify the column name (“LSOA name”) and use the LIKE operator and then write Camden, but with % on either side, so it will allow any character before or after ‘Camden’ (these are also called wildcards).

So the expression looks like this: "LSOA name" LIKE '%Camden%'

You want to take this and write it in the box on the left, which is where the expressions go.

Then click “Select Features” on the bottom right of this box, and then you will see it has selected all crimes which contain Camden in the LSOA name column, in my case 435 crimes.

To then export these, click on “Close” and then right click again the layer name, and this time choose “Export” and “Export Selected Features As…”

This will open a window where you can first choose what format to export. It is up to you what format you usually work with, here I will select GeoJSON because it is common and a small file:

Now you must also select where to save the layer and what to name it. Click again on the “…” (browse) next to the “File name” and a small window will pop up, use this to give your layer a name, and choose the location where to save it (ideally your working directory with your other course files!):

When you are done, click OK, and you will see a new layer in your layers list. If you want to hide the larger layer, just uncheck the box next to it like so:

Great, now we have crimes in Camden, let’s get some crowdsourced data to overlay!

Using an API

Above, we simply saved crimes data from the police.uk website. Another way of downloading open data is through an Application Programming Interface (API). This is a tool which defines an interface for a programme to interact with a software component. For example, it defines the sort of requests or calls which can be made, and how these calls and requests can be carried out. Here, we are using the term ‘API’ to denote tools created by an open data provider to give access to different subsets of their content. Such APIs facilitate scripted and programmatic extraction of content, as permitted by the API provider 1. APIs can take many different forms and be of varying quality and usefulness 2. For the purposes of accessing open data from the web, we are specifically talking about RESTful APIs. The ‘REST’ stands for Representational State Transfer. These APIs work directly over the web, which means users can play with the API with relative ease in order to understand how it works 2.

Image from https://www.twilio.com/blog/cool-apis with permission

Some examples

Here are some examples of data sets made available to request from using APIs, and some cool projects people have done with them.

Twitter

Instagram

Police.uk

NASA APIs

Spotify

City Mapper

CityMapper

Why use APIs?

Some data only accessible through API calls
You get fresh data automatically
Nearly any programming language can be used to access them (eg: Python, R, Java, JavaScript, Ruby, etc)

Often, developers who work with APIs will share their code, and release them in the form of a package or module, so that other people can use it. This is called a wrapper because it uses code that ‘wraps’ around the API to make it a neater, more usable package. Wrappers remove (or at least lower) many of the obstacles to accessing open data noted earlier. The wrapper can take many forms, such as a Python module, or an R package. It could even be a web interface that provides a graphical user interface (GUI) for accessing the API in question.

To demonstrate this, we will be accessing data from Open Street Map, using its web-based GUI called Overpass Turbo.

Open Street Map API wrapper: Overpass Turbo

To demonstrate wrappers, we will access data from Open Street Map, a database of geospatial information built by a community of mappers, enthusiasts and members of the public, who contribute and maintain data about all sorts of environmental features, such as roads, green spaces, restaurants and railway stations, amongst many other things, all over the world. As such, it is a prime example of ‘crowdsourced’ open data. You can view the information contributed to Open Street Map using their online mapping platform (https://www.openstreetmap.org/). The result of people’s contributions is a database of spatial information rich in local knowledge which provides invaluable information about places and their features, without being subject to strict terms on usage.

Open Street Map (OSM) is currently on API vversion 0.6, originally deployed 17-21 April 2009. The API is currently accessible using the following URL: https://api.openstreetmap.org/. Much like for the TfL API, which we could query without having to create any sort of login, we can query OSM data without authentication. However, all of the calls to the API which update, create or delete data have to be made by an authenticated and authorized user.

To read more about the details of the OSM API see the documentation.

Open Street Map has two types of wrappers available for its API, a web-based GUI called Overpass Turbo (https://overpass-turbo.eu/), and an R package called osmdata. We start with Overpass Turbo.

Overpass Turbo

When you open the link it will give you an example:

OT example

/*
This is an example Overpass query.
Try it out by pressing the Run button above!
You can find more examples with the Load tool.
*/
  
node
  [amenity=drinking_water]
  ({{bbox}});
out;

Overpass QL source code is divided into statements, and every statement ends with a semicolon ;. Statements are executed sequentially. Your query can contain any combination or number of OpenStreetMap elements (nodes, ways, and relations).

When you make your query structured like the example above (node[name="Foo"];), you write into a default set. So then, when you’re saving this default set, you save the result of your most recent query. If you wanted to save your result to a specific set, you must use the -> syntax when that set name is prefixed with ..

For example:

node[name="Foo"]->._;

The other thing specified in the example query above, is the area which you are interested in to search for drinking water amenities in. The shortcut {{bbox}} is used to take the bounding box of the map presented in the browser window as the current view as the boundary within which you want to perform your search. Another shortcut you might use is {{center}} which returns the center coordinates of the current view. If we wanted to specify the bounding box with coordinates instead, we can do so by specifying the four corners of our manually created bounding box in this order:

(bbox:south,west,north,east)

How can we find these coordinates? There are a few ways, we will look at using R to get a bounding box later on, but there are other GUI-based tools for example http://bboxfinder.com/ we can make use of as well. We can use this to navigate to Manchester for example, and come up with the following:

(bbox: 53.368643,-2.534510,53.587675,-1.895244)

So our query looks like:

/*
This is an example Overpass query.
Try it out by pressing the Run button above!
You can find more examples with the Load tool.
*/
  
node
  [amenity=drinking_water](bbox:53.368643,-2.534510,53.587675,-1.895244);
out;

This seems like a bit of work though, and instead, we can use the search bar on the map to find a location we’re interested in. Here let’s move to Manchester, and then run the query again so you can see drinking water availability in Manchester.

OT search Manchester

Exercise

Now that pubs are open, why keep drinking water? Change the query to show you all the pubs in Manchester.

Get Stations

Great, let’s go back to our stops along the Northern Line in London then. Use the search bar to find London, get our bounding box from our map view again using the {{bbox}} shortcut, and search for “station”. Well this returns nothing. We can see overpass turbo even tells us “map left intentionally blank”:

OT no results

This is because there are no nodes that are amenities labelled station. So what can we do?

Features in Open Street Map are defined through ‘keys’ and ‘values’. Keys are used to describe a broad category of features (e.g. highway, amenity), and values are more specific descriptions (e.g. cycleway, bar). These are tags which contributors to Open Street Map have defined. A useful way to explore these is by using the comprehensive Open Street Map Wiki page on map features. In this case there are no key-value pairs that match amenity-station. So what can we do to find the correct key-value pair?

To help build our queries, we can use the query Wizard. This is really helpful, but make sure you follow the documentation when structuring your queries. Click on the ‘Wizard’ option and enter station in the textbox:

OT wizard to the rescue

Now we finally have our stations:

stations!!!

We can play around with this, for example select only the nodes, and when we are happy, we can save the result using the ‘Export’ button. Here you can choose format, again it is up to you what you like to work with, I will choose GeoJSON again:

Save this in your working directory.

To import this into QGIS, you load like with the police.uk data, except this time this is not text data (it is not a CSV) but a GeoJSON (if you downloaded this format like me) which is vector data. So in this case you will choose “add vector layer” either from the file:

Or with the shortcut:

Then it is very similar to the steps for adding the crime data, except we find “…” (browse) next to the “Vector Dataset(s)” tab. Click on this, navigate to your GeoJSON file (I named mine “osm_stations.geojson”), and click “Add” and then “Close”.

A new layer called “osm_stations” should appear:

This is a lot of stations, let’s only select those on the Northern Line, which goes through Camden. Like with the crimes layer, right click and ‘open attribute table’ to see what information we have. You see we have a column called ‘line’. We can use this the same way we did with the ‘LSOA name’ column earlier to find only stops on the Northern line.

Click on the icon that is a yellow box with a purple equation symbol on it which is the “Select features using an expression” button.

This time, we want the column to be “line” and the characters we are matching to be “Northern” so our new expression is:

"line" LIKE '%Northern%'

Click “Select features” and then “Close” you will see how many you got, for me there are 50 features selected.

Save them as a new layer the same way we did with the Camden crimes (right click, Export, Save selected features as…). Again I’m using GeoJSON and I’m naming it northern_line:

Now, if you untick the osm_stations layer and zoom right in you will see your nothern line stations (black here) and your Camden crimes (pink here) and you can see that crimes recorded by the British Transport Police in Camden seem to tie in with these stations:

Getting Twitter data using Python

To get Twitter data, we are going to step outside of the QGIS environment, and move instead to some coding in Python. Don’t worry, it will be nothing too dangerous, I have created a python notebook on google collab, which you can play around with to follow along.

So first things first, you need to have your Twitter developer credentials ready.

IMPORTANT: you will have needed to get your Twitter Developer Account credentials in order to be able to follow along with the rest of this tutorial!

To find your credentials, log into your developer account https://developer.twitter.com/, and find your Developer Dashboard: https://developer.twitter.com/en/portal/dashboard. You can see already created apps here. For example, I have one called “Social media in policing and safeguarding”:

If you don’t have any yet, you should click on ‘Add App’. Here choose to ‘Create a new App’ option. Then give your app a name you like. Be creative, because many app names are already taken! Once you have given your app a name, you will be taken to the Keys & Tokens page. Here you will see your API Key and your API Key Secret:

NOTE: COPY THESE DOWN NOW you won’t be able to access these again. You need them to search Tweets from within QGIS later, so make sure you copy them and store them somewhere safe, like you would a password.

Once you’ve noted these down, go to your app. You can always access this app later, by clicking on the projects & apps tab later, and selecting the app name. Mine is (uncreatively) named rekastestapp:

Here you can change the settings of your app, like permissions, but let’s not worry about that for now. FOr now, click on “Keys and tokens” on the top unders your app name.

When you do this you will see you have your consumer keys - these are the api key and the api secret, which you noted down earlier, when you were creating the app. But we also need the Access Token and Secret. To generate these, click on the button next to it which says ‘generate’:

These values will pop up. Once again MAKE SURE TO SAVE THESE as you will not be shown them again. Once you have these saved, we can go to our collab worksheet.

Open this link in your browser (ideally Firefox or Google Chrome): https://colab.research.google.com/drive/1N3Ifc_e0Ca-6F4bKlzG6tj2sFT7dLK3V?usp=sharing

When you are there you will see some code. Please DO NOT EDIT this document. Instead, you will need to create a copy.

To do this, click on File > and Save a copy in drive:

You will need a google account for this, so log into your google account to continue. If you do not have a google account, you can make one quickly, or you can work with a friend.

Once you have copied this into your google account, you will be able to make changes. Follow the instructions on the notebook, and you will be rewarded with some Tweets!

Besides the examples in the notebook, you can explore Twitter data in other ways. You might like to explore the tutorials available from Twitter, to give you some inspiration for future work: https://developer.twitter.com/en/docs/twitter-api/tutorials.

Now have a look at the .csv file you downloaded. You can open this with Excel. Mine looks like this:

You can see the place column contains all sorts of different place resolutions, from London, UK to mode specific ones like Haringey, North London, and so on. If we wanted to look at a frequency table, we can create a pivot table in Excel to do this:

First, click on the “Insert” tab, and then click on “Pivot Table”:

When the popup window appears, click “OK”:

Then, when the pivot table appears drag the ‘place’ variable to rows, and again to the values boxes:

This will create a frequency table of the places which the Tweets are linked to. You can see, they are mostly vague, the top three most common being London, England, London, and London, UK (although I also get one that is “the globe”).

Mapping at this level would not add too much here. But there is other valuable data in here. We can read through the tweets and see what people are talking about when they tweet about police.

Some are reporting events they observe. For example:

Damn police already on campus

Some are reporting news stories:

Metropolitan Police officer cleared of rape https://t.co/Vu8N7CiL26 Get NewsPlayer+ #breakingnews #livestreaming #newsplayerplus

And others may be more difficult to categorise…

D woman was wrong learn to mind your bizness around init kids if police cones the woman sef don get yawa b dat o

In any case, there is plenty of rich data there, to start thinking about to use for research projects!

Conclusion

In conclusion this tutorial aimed to introduce you to open data, specifically crowdsourced and social media data. We acquired data using direct download, API query through a graphical user interface on the web, and API query through some Python programming. I hope this has been a fun exercise, and if you have any questions, don’t hesitate to keep in touch and ask!

–>

Crowdsourced and social media data for research

Reka Solymosi

10/11/2021