Research

$1.2 million NSF grant to create search engine for online privacy research

A grant will help researchers build tools to examine privacy documentation, which could lead to a safer internet for users. Credit: Credit: Dan Nelson/Unsplash. All Rights Reserved.

UNIVERSITY PARK, Pa. — A team of Penn State-led researchers recently received a $1.2 million National Science Foundation (NSF) grant to build a search engine and other resources that can make the web safer for users by helping scientists scour billions of online documents to more efficiently collect and classify privacy documentation.

The search engine — called PrivaSeer — will use a type of artificial intelligence (AI), called natural language processing — or NLP — to help researchers collect, review and analyze privacy documents including privacy policies, terms of service agreements, cookie policies, privacy bills and laws, regulatory guidelines and other related texts on the web.

NLP combines linguistics, computer science and AI to program computers that can better process and analyze large amounts of natural language data.

Ultimately, the search engine could help researchers better understand online privacy and online privacy trends, while also helping users navigate the web more safely and securely, according to Shomir Wilson, assistant professor of information sciences and technology, Penn State and Institute for Computational and Data Sciences affiliate.

“Privacy policies are documents that we encounter in our day-to-day lives when we visit websites and, in theory, we’re supposed to read them,” said Wilson. “But, in practice, few people do that. It’s not practical and it doesn’t fit into how people use the internet. People often don’t have the legal knowledge to understand these documents, either.”

Wilson, who serves as the lead principal investigator (PI) for the project, said that the search engine is needed because even though numerous documents about organizations’ privacy and data practices are available on the web, researchers face a daunting challenge of identifying and gathering these documents. According to the researchers, the current way to collect this information requires scientists to conduct painstaking manual searches.

“There’s been some prior work on privacy policies, but one thing researchers have run into is that there is a lack of good data on those policies,” said Wilson.

The search engine can also offer insights into how policies change and help users navigate the complex field of online privacy, according to C. Lee Giles, the David Reese Professor of Information Sciences and Technology, Penn State, and a project co-PI.

“One of the reasons to have a privacy policy search engine is that you can get an idea about how different companies treat their user privacy currently and over time,” said Giles, who is also an ICDS associate. “This can also inform users how they may want to react to those companies.”

The researchers said that PrivaSeer will also advance NLP techniques for large-scale interpretation of such privacy documents. This technology will help scientists analyze the state of privacy at an unprecedented scale.

Creating the search engine poses several challenges for the team, according to Giles.

“One of the challenges of building a privacy policy search engine is crawling the web for those pages,” said Giles. “There is no list of URLs for this. Does one try a URL — for example, 'https://company.com/privacy.html' — or something different? Once the page is returned, how do we know it is a privacy page?”

In addition to the search engine, the team also plans to develop corpora — large datasets of text — and application programming interfaces, or APIs.

Other PIs also include Florian Schaub, assistant professor of information, electrical engineering and computer science, University of Michigan, and Gabriela Zanfir-Fortuna, director for global privacy at the Future of Privacy Forum.

Last Updated June 28, 2021