Pubcrawler – The Autonomy Data Unit Blog

Check out the code on GitHub

The UK’s policy research landscape is made up of hundreds of organisations distinguished by their varying levels of research output, transparency and historical roots. Whilst Autonomy celebrated their 6th birthday earlier this year, several think tanks still operating in the UK trace their origins back to the 19th century. Faced with an ever growing number of research organisations generating a steady stream of publication, it is likely that our knowledge of many of these organisation is incomplete or out of date. Beyond the challenge of digesting an insurmountable volume of research, there is the more immediate difficulty in getting hold of all of the research an organisation has published. Searching for PDFs throughout the cavernous subpages of an organisation’s website is time consuming work. Furthermore if we would like to get metadata from publications like the authors, dates and subject matter, we are likely to encounter an infinite range of text formats.

To address some of these challenges we’ve developed Pubcrawler a simple flexible tool that applies webscraping to pull all the publications from an organisation’s website and then categorise them with LLMs. In this post we will demonstrate how the tool can automate the gathering of intel in a few simple steps, starting with an organisation’s url as input.

1. Crawl

We begin by providing ‘Pubcrawler’ with the url of the organisation we want to investigate. For testing purposes we decided to embark on the meta project of investigating our own organisation and so https://autonomy.work/ is the input. Using Selenium the tool will recursively crawl through every subpage that eventually links back to the Autonomy landing page, whilst saving the location of each PDF publication it spots along the way:

Crawling through the nearly 1000 subpages on the Autonomy site completed in less than an hour:

2. Publications

After navigating through the entire site, 190 PDF files were found. Before opening the files we assume that most of these PDFs were written by Autonomy whilst some will come from other organisations. The next stage requires downloading as many of the found files as possible using Selenium. Due to the presence of some links that no longer exist, only 170 publications could be successfully downloaded.

3. Metadata

With all the publications in one place we can begin the task of extracting useful metadata that will allow various characteristics of the organisation to be visualised. It is worth noting that PDF files often contain some metadata although it is often sparse and low quality. Here’s an example of some of the metadata we can grab from the Autonomy publications with PyPDF2:

	CreationDate	Creator	ModDate	Producer	Trapped	file_name	Title	Author	Keywords	Company	...	GTS_PDFXConformance	GTS_PDFXVersion	PTEX.Fullbanner	XPressPrivate	WPS-ARTICLEDOI	WPS-JOURNALDOI	WPS-PROCLEVEL	Comments	Appligent	SPDF
0	D:20230710141411+01'00'	Adobe InDesign 18.2 (Macintosh)	D:20230710141419+01'00'	Adobe PDF Library 17.0	/False	Treating-causes-not-symptoms-Jul-23.pdf	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	D:20200825145942+01'00'	Microsoft® Word for Office 365	D:20200825145942+01'00'	Microsoft® Word for Office 365	NaN	Public-Sector-as-Pioneer-2.pdf	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	D:20230608103737+02'00'	Adobe InDesign 18.3 (Macintosh)	D:20230608103741+02'00'	Adobe PDF Library 17.0	/False	BASINCSHORT.pdf	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	D:20181215103417Z00'00'	NaN	D:20181215103417Z00'00'	GPL Ghostscript 8.70	NaN	VERSION FOR ARCHIVING.pdf	Mental healthcare staff well‐being and burnout...	Johnson, J, Hall, LH, Berzins, K, Baker, J, Me...	burnout; health services; mental health; patie...	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	D:20190909192632+01'00'	Acrobat Pro DC 19.12.20040	D:20190909192632+01'00'	Acrobat Pro DC 19.12.20040	NaN	PEF_Skidelsky_How_to_achieve_shorter_working_h...	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 26 columns

To extract higher quality metadata from the actual content within a PDF we require an approach that is flexible towards the vast differences in presentation format and structure across documents.

Thankfully language models like GPT-3.5 and GPT-4 demonstrate strong capabilities in entity recognition across a broad range of text formats. It is even possible to pass an entire PDF to GPT-4 within a single API call due to recent extensions in the model’s context window, however this becomes costly when processing multiple documents (potentially several $ per document), incentivising us to minimise the number of tokens transmitted via API.

To demonstrate the kinds of metadata that could be extracted reliably from PDF text, we attempt to get some basic bibliographic information on each document, including authors, dates and institutional affiliations. As a general rule, much of this information is likely to exist in the first few pages of each document, therefore passing the first 5 pages of each PDF to GPT-3.5/GPT-4 should allow us to save costs. Using the following function call, we instruct the model to return the desired metadata in a structured format:

{'name': 'generate_citation',
 'description': 'Generate a citation for report',
 'parameters': {'type': 'object',
  'properties': {'title': {'type': 'string',
    'description': 'Report title. If unknown leave empty'},
   'authors': {'type': 'array',
    'description': "Array of the author's full names. If unknown leave empty",
    'items': {'description': "Author's full name", 'type': 'string'}},
   'organisation': {'type': 'array',
    'description': 'Array containing the names of the research organisations that produced the report. If unknown leave empty',
    'items': {'description': "Research organisation's name",
     'type': 'string'}},
   'date': {'type': 'string',
    'description': 'Date the report was published on. Try to find the day, month and year. If unknown leave empty'},
   'keywords': {'type': 'array',
    'description': 'Array of keywords that indicate the content of the report. If unknown leave empty',
    'items': {'description': "Keyword title ie 'feminism'", 'type': 'string'}},
   'funders': {'type': 'array',
    'description': 'Array of organisations that provided funding or financial support. If unknown leave empty',
    'items': {'description': 'Name of funding organisation',
     'type': 'string'}}},
  'required': ['title',
   'authors',
   'organisation',
   'date',
   'keywords',
   'funders']}}

By passing the first 5 pages from the following report to GPT-4 with the above function call, we get the proceeding structured output in return:

{'title': 'The Shorter Working Week: A Radical And Pragmatic Proposal',
 'authors': ['Will Stronge',
  'Aidan Harper',
  'Danielle Guizzo',
  'Kyle Lewis',
  'Madeleine Ellis-Petersen',
  'Nic Murray',
  'Helen Hester',
  'Matt Cole'],
 'organisation': ['Autonomy', 'Autonomy Research Ltd', '4 Day Week Campaign'],
 'date': '2019',
 'keywords': ['Shorter Working Week',
  'Current Model of Work',
  'Future Model of Work-Time'],
 'funders': []}

When applied across all of Autonomy’s documents we observe that both GPT-3.5 and GPT-4 are able to extract entities with minimal errors in spelling/formatting, lifting most entities correctly from the test. The main sources of inaccuracy stem from false positive and false negative classifications of persons and organisations as authors or funders. GPT-3.5 seemed to be much less capable in recognising funders than GPT-4. However whilst GPT-4 demonstrates superior classification quality, GPT-4 cost $9.79 to classify 850 pages of text whilst GPT-3.5 cost $0.41.

4. Visualisation

With GPT-4 we have generated structured metadata for each downloaded PDF that can be plugged into data visualisations. GPT-4 was prompted to return a list of authors for each publication and so we can easily visualise the most common occurances. Its unsurprising to find Autonomy’s founders and researchers make up the top 6 most cited authors:

The same approach can be followed to visualise Autonomy’s most frequent funders and institutional collaborators (usually in the form of co-authors). GPT-4 has been broadly successful in differentiating between these two fuzzy classes of organisation. This metadata could be prove useful in connecting research organisations together and sketching how wider political projects are funded:

The presence of keywords within text can indicate which policy areas are of greatest interest to an organisation. Whilst the following results are quite general and unsurprising in Autonomy’s case given that the organisation is focussed on the future of work, it can be useful to compare the most common keywords with the stated aims of an organisation for misalignment:

Temporal data can help us chart the peaks and troughs in research output over time:

Where the month of publication is mentioned we can generate a more granular map:

It is also possible to explore how collaborative an organisation is through investigating instances of co-authorship. The following plot shows which authors found in Autonomy’s reports have collaborated together, represented by an edge connecting them. Everytime two authors are found in the same citation we count this as a single collaboration. Edge weight represents the number of times the authors have collaborated:

5. Conclusion

In this brief post we’ve demonstrated the automated search and retrieval of an organisation’s publication archive from ~1000 webpages in less than hour. LLMs can process these large collections of documents producing rich metadata to power useful visualisations. We continue to find the possibilities of combining web scraping and LLMs exciting as they allow us to build highly versatile information retrieval systems that easily overcome the nuances of heterogeneous web design and file formats. We look forward to testing ‘Pubcrawling’ on a much wider set of organisations and expanding the kinds of metadata we retrieve.