A setup to semi-automate a check for third-party scripts

Open Web Privacy Measurement with OpenWPM

In May 2018, when GDPR-Panic was at its peak, I needed a setup to check websites for GDPR-related third-party scripts. Since I am administrating more than 30 websites I needed to automate this task somehow.
The more technical parts of GDPR all relate to third-party scripts and cookies.
Even when it was me, who developed the site, I could not guarantee that editors have not posted some embedding codes into the wordpress editor. Embedding a youtube video for example, will load scripts, styles and images from youtube server to the browser of a visitor. Therefor the IP-Address of the visitor is shared with youtube – which is – strictly – already a a violation of GDPR, since you (website owner) is transfering personal data (IP address of your visitor) to a third-party without the consent of this person.

So to be safe you should do a check which third-party services are used on each page.

Setup is the following:

  1. Extract URLs from a website, not just the homepage, at least every linked page. (There are many scripts for this)
  2. Take these URLs and run it through a tool called OpenWPM. OpenWPM simulates a real browser, and collects all http-traffic into a database. Not just static” http-traffic, but also everything coming from javascript calls, the cookies set by javascript and so on.

OpenWPM produces 14 tables that you can inspect and cross-check.

The 14 tables OpenWPM produces for inspection

 

The following OpenWPM script runs its test on the big german newspaper “spiegel.de”.

 

Python Script for OpenWPM:


from __future__ import absolute_import
from automation import TaskManager, CommandSequence
from six.moves import range

# The list of sites that we wish to crawl
NUM_BROWSERS = 1

sites = ["http://www.spiegel.de/"]

# Loads the manager preference and 3 copies of the default browser dictionaries
manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS)

# Update browser configuration (use this for per-browser settings)
for i in range(NUM_BROWSERS):
    # Record HTTP Requests and Responses
    browser_params[i]['http_instrument'] = True
    # Enable flash for all three browsers
    browser_params[i]['disable_flash'] = True
    browser_params[i]['http_instrument'] = True # Record HTTP Requests and Responses
    browser_params[i]['cookie_instrument'] = True # Records both JS cookies and HTTP response cookies to javascript_cookies
    browser_params[i]['js_instrument'] = True


browser_params[0]['js_instrument'] = True
browser_params[0]['save_javascript'] = True
browser_params[0]['headless'] = True  # Launch only browser 0 headless

# Update TaskManager configuration (use this for crawl-wide settings)
manager_params['data_directory'] = '~/Desktop/openWPM/spiegel/'
manager_params['log_directory'] = '~/Desktop/openWPM/spiegel/'

# Instantiates the measurement platform
# Commands time out by default after 60 seconds
manager = TaskManager.TaskManager(manager_params, browser_params)

# Visits the sites with all browsers simultaneously
for site in sites:
    command_sequence = CommandSequence.CommandSequence(site,reset=False)

    # Start by visiting the page
    command_sequence.get(sleep=0, timeout=60)
    # command_sequence.browse(num_links=30,sleep=0, timeout=60)

    # command_sequence.extract_links(30)
    # dump_profile_cookies/dump_flash_cookies closes the current tab.
    command_sequence.dump_profile_cookies(120)
    # command_sequence.screenshot_full_page()

    # index='**' synchronizes visits between the three browsers
    manager.execute_command_sequence(command_sequence, index='**')

# Shuts down the browsers and waits for the data to finish logging
manager.close()

 

 

This will produce a sqllite database at the place you specified with data_directory with a lot of interesting data.

Some simple queries can give you a list of hosts who set cookies in your browser:

select distinct host from javascript_cookies;

 

I got 14 Javascript Cookies (Sept. 2018)

“www.spiegel.de” “.ioam.de” “.spiegel.de” “c.spiegel.de” “.doubleclick.net” “.yieldlab.net” “.xplosion.de” “ups.xplosion.de” “.config.parsely.com” “.theadex.com” “.adfarm1.adition.com” “ad13.adfarm1.adition.com” “.twiago.com” “.adsrvr.org”

 

Lets do the same on an other interesting table:

select distinct script_url from javascript;

 

And this is the result (18 distinct script_urls)

  1. “http://www.spiegel.de/layout/js/http/javascript-V8-56.js”
  2. “https://script.ioam.de/iam.js”
  3. “http://www.spiegel.de/layout/js/http/messaging-V8-56.js”
  4. “https://www.googletagmanager.com/gtm.js?id=GTM-WJQWWTD”
  5. “http://www.spiegel.de/”
  6. “http://www.spiegel.de/layout/js/http/netmind-V8-56.js”
  7. “http://s290.mxcdn.net/bb-mx/serve/mtrcs_897887.js”
  8. “http://www.spiegel.de/staticgen/data_imports/emstm/spiegel-www/live.js”
  9. “https://www.google-analytics.com/analytics.js”
  10. “https://script.hotjar.com/modules-1fba13cbb2ccc31138fe484993444853.js”
  11. “http://static.emsservice.de/autoNative/project/autoNative.min.js”
  12. “https://www.googletagservices.com/tag/js/gpt.js?0.2927051360420776”
  13. “https://static.criteo.net/js/ld/publishertag.standalone.js”
  14. “https://s79.mxcdn.net/bb-mx/serve/mtrcs_799752.js”
  15. “https://securepubads.g.doubleclick.net/gpt/pubads_impl_263.js”
  16. “http://cdn.emetriq.de/adp/profiling/0.1.13/p.min.js”
  17. “https://optout.adalliance.io/status/”
  18. “https://tpc.googlesyndication.com/pagead/js/r20181003/r20110914/activeview/osd_listener.js”

 

Hope this GDPR-Law would show more teeth (denglish) soon.
Maybe OpenWPM could help with some reliable and large-scale privacy data.