House of the Dragon Season 2 word rareness & search trends

Motivation¶

I normally browse Reddit for discussion after a show, as one typically does. In one of House of the Dragon Season 2 episodes, one of the comments in the discussion threads was kinda jokingly say how adults learn new words, like “comportment”, due to the show. Sadly, I don’t remember the exact comment and I couldn’t search for it, so I can’t credit that. Sorry internet stranger!

Anyway, this prompted me to ask whether or not people would typically search for weird / rare words used in the show, at least in this season. As in, would there be higher search trends for rarer words used in the show?

Hence, I set out to collect the scripts from season 2 using Springfield! Springfield!, cross-reference with characters’ names from IMDB with cinemagoer to also construct show-specific words, and then collect trends using DataforSEO API. The words’ usage metrics, including how rare they are (via wordfreq) and the relative usage in the show compared to their baseline frequencies, are then compared with the search trends.

I was able to find many rare words that were highly searched possibly due to the show. In addition, it seems that the rarer (i.e. low baseline freqeuncies) and more relatively used the words, the higher people seem to search for them, possibly to look up their meanings. And of course, show-specific words have much higher search trends overall.

Requirements¶

First, install the Python requirements with:

pip install -r requirements.txt

Additionally make is also required. However, this can be bypassed by just manually copying the commands in Makefile.

To collect trends data from DataforSEO API, an account and API key are needed to fill .env file (as examplified by .env.example).

Data collection¶

First collect the scripts of season 2 of House of the Dragon from Springfield! Springfield! can be downloaded using:

make download-scripts

The scripts would be downloaded into data/house-of-the-dragon-2022_scripts.json.

Additionally the some NLP models / data are also needed:

make download-nlp-essentials

Then head over to collect.ipynb notebook to process the scripts for word extraction, filtering and trend data collection using DataforSEO API.

This notebook requires API key by creating an account with DataforSEO. Create .env from an example .env.example with what’s needed in that file.

Visualization¶

The data are processed and visualized in the visualize.ipynb notebook.

This combines data from the show’s word usage data (i.e. via scripts) and search trends data from DataforSEO:

hotd-s2-words.csv: processed word usage from show S2 scripts.
data4seo-word-trends-[90d,12m].json: keyword trends data using DataforSEO API.

These data are processed and combined to output the following figures (in figures folder):

hotds2-rare-word-trends.svg: Rare word trends (90d)
hotds2-stacked-word-trends-colored-by-base_freq_quartile.svg: Individual word trends (more coarse), colored and sorted by word rareness
hotds2-stacked-word-trends-colored-by-log10_ratio_quartile.svg: Individual word trends (more coarse), colored and sorted by their relative usage in the show scripts
hotds2-bulk-agg-word-trends.svg: Bulk trends across time
hotds2-after-air-trends-vs-usage.svg: Aggregate trends after show air and word’s usage metrics

Results¶

Trends of highly used rare (non show-specific) words that possibly increase because of the show. — Figure 1:Trends of highly used rare (non show-specific) words that *possibly* increase because of the show.

First off, Figure 1 plots trends of rare words that are highly used and have high search trend peaks after the show airs. Each horizontal line is a compressed search trend of a word. Thin vertical lines show when each episode airs. Small black dots indicate when the word appears in an episode (a word can appear in many episodes). Vertical ticks on trend lines mark detected trend peaks. Gray ticks are peaks before the word appears in Season 2. Colored peaks signify the episode the word first appears in Season 2.

This figure illustrates there are many rare words that were searched during the airing of season 2. Notice how “pliancy” search peaked after its only appearance in episode 3. Note that some words may have high peak before their appearance in the show, but the search rate and interest after their appearance may increase noticeably afterwards, which is likely because of the show. For example, see “comportment” search peaked after season 7, and the peak before that, though still interesting, may have been due to other things in the world.

Next, to address whether rarer words are more likely to be searched, I also did another batch containing (1) show-specific words (as some can still be considered rare) and (2) more common words used in the show.

Figure 2:Different categories of words (colors) and their word-aggregate trends.

Figure 2 shows the 3 categories of words (colors), along with their bulk word-aggregate trends, either via median (top) or mean (bottom). The mean-aggregate peak for rare word (blue, bottom) may just be only for 1-2 words that overwhelmingly influence the statistics, hence the reason for why the median-aggregate is also included. Regardless, both of these panels highly suggest higher search trends due to the show’s airing for show-specific words (red) and highly used (as in high relatively in the season) rare words (blue), but maybe not so much for more common words (grey). Interestingly, there seems to be a smaller but noticeable increase in search interests during April / May. It may be due to the Season 1 rewatch before the Season 2 airing, or maybe just students binging the shows before finals / graduation. It is expected for the show-specific words (red), but one would have to look back in Season 1 scripts to see whether the non show-specific words (blue and gray) also appear in them.

Figure 3:After-airing time-aggregate trends vs word rareness/usage statistics.

Lastly, Figure 3 illustrates how word rareness (i.e. low baseline frequency) and relative usage (i.e. ratio between script and baseline frequencies) highly affect their time-aggregated search trends after the show airs. Additionally, being show-specific boosts their overall search interests even further, on top of the word usage/rareness statistics.