A setup to semi-automate a check for third-party scripts
Open Web Privacy Measurement with OpenWPM
In May 2018, when GDPR-Panic was at its peak, I needed a setup to check websites for GDPR-related third-party scripts. Since I am administrating more than 30 websites I needed to automate this task somehow.
The more technical parts of GDPR all relate to third-party scripts and cookies.
Even when it was me, who developed the site, I could not guarantee that editors have not posted some embedding codes into the wordpress editor. Embedding a youtube video for example, will load scripts, styles and images from youtube server to the browser of a visitor. Therefor the IP-Address of the visitor is shared with youtube – which is – strictly – already a a violation of GDPR, since you (website owner) is transfering personal data (IP address of your visitor) to a third-party without the consent of this person.
So to be safe you should do a check which third-party services are used on each page.
Setup is the following:
- Extract URLs from a website, not just the homepage, at least every linked page. (There are many scripts for this)
- Take these URLs and run it through a tool called OpenWPM. OpenWPM simulates a real browser, and collects all http-traffic into a database. Not just static” http-traffic, but also everything coming from javascript calls, the cookies set by javascript and so on.
OpenWPM produces 14 tables that you can inspect and cross-check.

The 14 tables OpenWPM produces for inspection
The following OpenWPM script runs its test on the big german newspaper “spiegel.de”.
Python Script for OpenWPM:
from __future__ import absolute_import from automation import TaskManager, CommandSequence from six.moves import range # The list of sites that we wish to crawl NUM_BROWSERS = 1 sites = ["http://www.spiegel.de/"] # Loads the manager preference and 3 copies of the default browser dictionaries manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS) # Update browser configuration (use this for per-browser settings) for i in range(NUM_BROWSERS): # Record HTTP Requests and Responses browser_params[i]['http_instrument'] = True # Enable flash for all three browsers browser_params[i]['disable_flash'] = True browser_params[i]['http_instrument'] = True # Record HTTP Requests and Responses browser_params[i]['cookie_instrument'] = True # Records both JS cookies and HTTP response cookies to javascript_cookies browser_params[i]['js_instrument'] = True browser_params[0]['js_instrument'] = True browser_params[0]['save_javascript'] = True browser_params[0]['headless'] = True # Launch only browser 0 headless # Update TaskManager configuration (use this for crawl-wide settings) manager_params['data_directory'] = '~/Desktop/openWPM/spiegel/' manager_params['log_directory'] = '~/Desktop/openWPM/spiegel/' # Instantiates the measurement platform # Commands time out by default after 60 seconds manager = TaskManager.TaskManager(manager_params, browser_params) # Visits the sites with all browsers simultaneously for site in sites: command_sequence = CommandSequence.CommandSequence(site,reset=False) # Start by visiting the page command_sequence.get(sleep=0, timeout=60) # command_sequence.browse(num_links=30,sleep=0, timeout=60) # command_sequence.extract_links(30) # dump_profile_cookies/dump_flash_cookies closes the current tab. command_sequence.dump_profile_cookies(120) # command_sequence.screenshot_full_page() # index='**' synchronizes visits between the three browsers manager.execute_command_sequence(command_sequence, index='**') # Shuts down the browsers and waits for the data to finish
logging manager.close()
This will produce a sqllite database at the place you specified with
with a lot of interesting data.data_directory
Some simple queries can give you a list of hosts who set cookies in your browser:
select distinct host from javascript_cookies;
I got 14 Javascript Cookies (Sept. 2018)
“www.spiegel.de” “.ioam.de” “.spiegel.de” “c.spiegel.de” “.doubleclick.net” “.yieldlab.net” “.xplosion.de” “ups.xplosion.de” “.config.parsely.com” “.theadex.com” “.adfarm1.adition.com” “ad13.adfarm1.adition.com” “.twiago.com” “.adsrvr.org”
Lets do the same on an other interesting table:
select distinct script_url from javascript;
And this is the result (18 distinct script_urls)
- “http://www.spiegel.de/layout/js/http/javascript-V8-56.js”
- “https://script.ioam.de/iam.js”
- “http://www.spiegel.de/layout/js/http/messaging-V8-56.js”
- “https://www.googletagmanager.com/gtm.js?id=GTM-WJQWWTD”
- “http://www.spiegel.de/”
- “http://www.spiegel.de/layout/js/http/netmind-V8-56.js”
- “http://s290.mxcdn.net/bb-mx/serve/mtrcs_897887.js”
- “http://www.spiegel.de/staticgen/data_imports/emstm/spiegel-www/live.js”
- “https://www.google-analytics.com/analytics.js”
- “https://script.hotjar.com/modules-1fba13cbb2ccc31138fe484993444853.js”
- “http://static.emsservice.de/autoNative/project/autoNative.min.js”
- “https://www.googletagservices.com/tag/js/gpt.js?0.2927051360420776”
- “https://static.criteo.net/js/ld/publishertag.standalone.js”
- “https://s79.mxcdn.net/bb-mx/serve/mtrcs_799752.js”
- “https://securepubads.g.doubleclick.net/gpt/pubads_impl_263.js”
- “http://cdn.emetriq.de/adp/profiling/0.1.13/p.min.js”
- “https://optout.adalliance.io/status/”
- “https://tpc.googlesyndication.com/pagead/js/r20181003/r20110914/activeview/osd_listener.js”
Hope this GDPR-Law would show more teeth (denglish) soon.
Maybe OpenWPM could help with some reliable and large-scale privacy data.