How We Used Government Job Posts to Understand Tech Trends
Last year, the federal government spent over $67 billion on IT. And with so much money floating around, it can be difficult to understand precisely where that federal spending is going.
Traditionally, individuals turn to analyzing information contained within publicly available contracts, but these contracting details are often intentionally vague. More importantly, they do not capture the more long-term technology investments a federal agency can make – the hiring of tech-savvy staff. The average tenure of a federal employee is nearly eight years. Most contracts are less than a year. By comparison, committing to hiring a technology-skilled employee is a large decision.
With this backdrop, we decided to dig into federal jobs data to understand trends in technology hiring for one of the hottest topics of early 2023 – artificial intelligence. The result of our investigation and analysis was the creation of our first article -- “Embracing Innovation: The Rise of AI / ML in the Federal Government.”
So how did we get from idea to insights?
Getting the Data
We started by identifying the federal job board, USA Jobs, as a way to understand trends in federal technology investments. The Office of Personnel Management (OPM) maintains a backend application programming interface (API), which enables bulk download of historic federal job announcements. Upon inspection of the available data, we found that the endpoint was insufficient for our purposes. For our analysis, we needed the raw text announcement and not just its metadata like agency and general schedule.
We then turned to the next tool – a Freedom of Information Act (FOIA) request. FOIAs can be useful for dislodging data from older systems or getting better documentation. Through our submission, we learned that there was an undocumented endpoint to pull the historic announcement text. Undocumented endpoints are a common use case where a lack of clear and consistent documentation can make access to publicly available datasets more difficult. After several tries and a few help desk tickets, we extracted all announcements from 2019 onward.
Analyzing the Data
Since job postings do not contain technology tags by default, we needed a way to extract relevant terms. We initially took a broad scope, trying to find a solution that scaled for other technology terms, should we want them. In natural language processing, this task is called named entity recognition (NER). While there are thousands of ways to approach NER, a few tools exist specifically to analyze job postings. We iterated through several tools, including SkillNER, an open-source skill extraction tool, and Lightcast, a closed-source API built on job posting data.
After several rounds of testing, we concluded that these NER models were insufficient for our purposes. The model identified too many false positives and junk items unrelated to our analysis. A few examples:
- CIA was classified as a ‘central internal auditor’ skill requirement rather than as the organization Central Intelligence Agency.
- An overly generic “management” skill from “people management” to “data management.”
- Under specialized vocabulary to skip over terms like “NER.”
We pivoted to something much simpler – keyword search. We developed a vocabulary of artificial intelligence related terms based on samples of USA Jobs postings and used this keyword bank to perform a search across all the announcements. From there, we merged that data with the job metadata (posting date, agency, and more) to create a dataset of jobs related to AI/ML in the federal jobs market. We now had a working dataset.
What’s Next?
We believe in the potential of open government data to provide insight into government administration and society at large. In particular, this analysis revealed the need for deeper, industry-specific NER models tuned to technology to allow us to understand how technology is evolving. Petabytes of contracts, job postings, and other text data sit untapped for producing useful market insights.
Beyond the mention of the technologies themselves, there’s additional value in creating a useful metadata layer to extract broader trends. Is the U.S. government embracing open source? Are fast-rising technologies built by U.S. organizations or by foreign companies? What domains are seeing the most growth in modern technologies?
We look forward to exploring these questions in the coming months as we seek to translate our analysis into novel tools.