How We Used ChatGPT to Analyze Congressional Finances
Each year, members of Congress submit financial disclosure forms that detail their assets and debts. These forms allow the American public to assess whether elected officials have placed their country's interest before their own.
But neither the House nor the Senate allows you to analyze data for multiple elected officials at once. Both systems for accessing the data require you to look up information by individual. The Senate makes this data available as web tables, and the House makes it available as PDFs.
Unfortunately, this data's siloed, inconsistent nature is not uncommon among government documents. To unlock the potential of these data and the like, we developed an AI-based application to extract the contents of House members' PDF disclosure forms.
How We Did It
Our first attempt to extract House member's financial data using pattern matching failed miserably. Minor formatting differences between forms repeatedly broke a complex, fragile Python script (as well as our developer's spirit).
What we needed was intuition. In the face of formatting inconsistencies, we needed a technology that could understand that, no, the value of an asset is not "location: North Dakota." New AI tools known as large language models (LLMs) offered just that. Working in Python, we built an application that sends text from PDFs to ChatGPT and returns well-structured financial data, ready for analysis.
Here's how we did it:
- Gather Data: The House Committee on Ethics releases an annual dataset containing metadata on all financial disclosure submissions. We matched the URL of each House member's financial disclosure form with a broader congressional dataset to get the information we needed to carry out the project.
- Chunk and Clean: ChatGPT limits the amount of text you can send at once. Before sending data to its API endpoint, we first needed to break the text of financial disclosure forms into meaningful chunks. We did this by breaking up the text by page number and extracting only relevant financial information.
- OpenAI API: OpenAI offers an easy way to send API requests to ChatGPT using Python. Following this guide, we asked ChatGPT to return financial data in a structured table format, where each row represented information about a financial item.
- Concurrent Processing: Earlier versions of our application were too slow. ChatGPT can take several minutes to process more complex PDF pages. Waiting for one page to process before another began meant the program would take days to complete. To speed things up, we implemented parallel processing, allowing us to send multiple API requests simultaneously while staying under ChatGPT's request rate limits.
- Clean Again: Like people, ChatGPT makes mistakes. Examining the aggregated data, we identified rows where the AI inserted incorrect values. Since these errors constituted a very small percentage of the overall data, we removed them.
- Add Manually Processed Data: Not all members of the House submit their financial disclosure forms electronically. Some submit manual forms, which are made available as grainy image PDFs. Some of these we processed by hand, while some were too long to process at all. As a last step, we joined the manually processed data with the AI-processed data.
Room for Improvement
Our most significant problem was page breaks. Sometimes, information about a single asset or liability was broken up over two pages. Since ChatGPT did not look at two pages at once, it would get confused when this happened. Proactively identifying such page breaks and cleaning the text around them helped ChatGPT interpret the correct values, but we are still searching for a more robust solution.
Looking Ahead
What excites us most about this application is its flexibility. The data extraction method for this project is easily applicable to other sources of unstructured data, like policy documents, images, and audio. A simple change to the ChatGPT prompt allows us to ask unlimited questions about a massive array of data.