About this service

Why did we create the Consent Observatory?

We created the Consent Observatory to make it easier for more people to study consent interfaces on the web.

Consent pop-ups have become ubiquitous on the web as companies whose business models rely on personal data processing try to grapple with data protection regulation across the world. Researchers, regulators, and policy makers have become interested in understanding whether those interfaces actually comply with the regulation, how they evolve over time, what geographic differences are, etc. However, studying consent banners at scale requires extensive computational infrastructure and domain expertise, and this complexity is currently a barrier for many people. By making our data gathering tools public, we hope to make it easier for more people to study this topic and to prevent unnecessary and time-consuming duplication of software.

Methodology

The data is collected using a scraper with custom detection scripts.

Scraper

The scraper is open source and can be found here.

The scraper is built using a combination of puppeteer (a Google scraping library) and custom Node JS code. The scraper is run with the following specifications:

  • Headless
  • Concurrency: 15 URLs (to balance speed with anti-scraper protections)
  • Time-out: 90 seconds without HTTP response
  • Puppeteer v. 18
  • Chromium v. 107.0.5296.0
  • Server: located in Denmark and managed by Aarhus University

Detection scripts

The analysis of each consent pop-up is done on the basis of a custom-built script. You can read about it in more detail in our paper: A Cross-Country Analysis of GDPR Cookie Banners and Flexible Methods for Scraping Them.

  • Consent Interface: looks for all elements with a z-index > 10 and position='fixed' that contains one of the words in the multi-lingual word corpus
  • CMP name: looks inside the detected consent interface for elements that match a regex pattern of known CMP selectors.
  • IAB CMP info: pings the IAB TCF __tcfapi to get their ID
  • User options: looks at the inner text and attribute values of elements inside detected consent interfaces and, after normalising, checks whether it matches a word corpus within a levenshtein distance of 1.
  • Toggles: looks for all elements with input[type='checkbox'] or [role='checkbox'] and returns their checked and disabled status.
  • DOM: returns a copy of the html element 10 seconds after the DOM is fully loaded.
  • Cookies placed: returns all first and third party cookies present 10 seconds after the DOM is fully loaded.
  • Elements with event listeners: looks for all visible elements that have some text and which are inside a detected consent interfaces that have event listeners attached listening for a click event.
  • Button elements: finds all visible elements inside a detected consent interface that matches a button selector.
  • Visibility analyser: scores the visibility of an element based on its size, the contrast between its colour and the background, its border, font styling, and text decoration
  • Click listener analyser: scores how interactive an element is by evaluating if the element has any click event listeners, how many, and how far up the DOM tree they are.
Anim pariatur cliche reprehenderit, enim eiusmod high life accusamus terry richardson ad squid. Nihil anim keffiyeh helvetica, craft beer labore wes anderson cred nesciunt sapiente ea proident.

Limitations

There are a number of limitations that affect the reliability and validity of our analysis, which should be considered when using this service for research.

  • We can only detect what we know. Because some of our detection scripts are based on text or selector matching, there might be some interfaces and features that we cannot detect. However, our accuracy rates are very high (see p.6 of the paper).
  • Updates can break our scripts. Because of our approach described above, it is possible that a previously working script breaks when a pop-up provider changes the part of their code that we use to detect or interact with.
  • Anti-scraper protection. Some websites use specific methods to detect scrapers and can block us from analysing this website. This means that a human visiting a site will see something else than our analysis indicates.
  • Server location. Websites sometimes infer the location of the visitor and adapt their interface, even within the same legal jurisdictions. Our servers are located in Denmark, and this might mean we see a different version of the website.

About us

This service is created and maintained by workers at Aarhus University, Denmark.

If you want to talk about Consent Observatory, how it could be used for your research, or if you have any issues, please contact midasnouwens@cc.au.dk