While the original dataset stems from Quotebank and spans comprehensively the English-speaking media world from 2015 to early 2020, for our analyses we decided to focus on quotes originating from US media. Furthermore, not all quotes had identified speakers, so we chose only those where we had a wikidata handle to extract their genders.
But then, we had a quite boring dataset not suited to go on an ADAventure into the US media landscape. Beyond the wikilinks to the identified speakers, which were already part of Quotebank, there was not much to explore. Thus, we expanded it by first of all linking it to the online resource Media Bias Fact Check, that provided us exactly what we needed: a comprehensive database of the majority of English speaking (and an even larger portion of American media) in which they classify all their entries based on their country of origin, factual reporting, their political bias (left-leaning, right-leaning) and the traffic of the respective media webpage. The methodologies for assigning these attributes can be found here. We decided to perform our analyses on US media with high traffic, as our goal was to investigate mainstream female representation that is consumed most broadly.
There are more than a thousand different media sources across the USA. As the actual ratio of sources for each political bias is similar for both all media sources and only high circulation sources, we will only use the latter as to include chief-editors in the the dataset.
Therefore the skewed represenation of left-leaning and right-leaning media is not due to us selecting the unrepresenatitive tip of the iceberg but intrinsic to the US media landscape. The same trend is observed also for the number of quotes in the data base, so with the setup we chose, we have to retain the bias. However, since we still deal with relatively large numbers (1.5 milion quotes) and are interested in overall trends, we have enough data for both subcategories to do the topic analyses we plan.
As we are also interested in female representation behind the scenes. According to studies, the situation in the US media landscape is dire (find an executive summary on female chief editors), we set out to see for ourselves and annotated the 147 highest traffic online media by hand with the current editor-in-chief and their respective gender. Ideally, we would have liked to know the genders of the journalists responsible for the articles, giving an even better insight on whether female reporting would increase female voice representation, but this data was intractable, so we had to resort to the next higher level that likely has an influence on the content that is created - thus editors-in-chief.
It should be mentioned that in an ideal world we would have liked to work even more fine-grained by classifying each medium by its topics and then only comparing media with the same focus (left-leaning lifestyle magazines with right-leaning ones, etc.), as it is obviously a slight bias introduction to not account for a balance in this respect when analysing recurring topics. However, already researching 150 editors by hand was quite resource intensive. In order to be able to work with enough media for each category, we would have had to go far beyond that number. So we decided to stay at this coarse-grained distinction between media simply based on their political affiliation but keeping this potential confounder in mind for the topic assignments.
Which brings us to the last point in constructing our dataset: To obtain the final data set amenable for topic analysis we used uniform random sampling across the above formed dataset to obtain a smaller data size for this task. For equal representation of the years, the same number of quotes per year were sampled 250 000 data points rather than a percentual fraction per year. For our topic analyses on the quotes we ran the Gibbs Sampling Dirichlet Multinomial Mixture algorithm - which is the computationally less efficient little sister of LDA - but has shown better results for short text topic analyses. As we want to extract the topic of roughly single sentences, the criteria for LDA are not met and we thus had to use GDSMM. Background info on GSDMM.