Visualization: Are more movies set in the future or the past?

Visualization: Are more movies set in the future or the past?#

Motivation#

I was curious about the balance between media set in the future vs movies set in the past. Luckily, people on Wikipedia have tagged movies both for when they are produced/released, as well as whether they are set in certain years, decades or centuries.

For example, Category:Films_set_in_1945 contains list of movies set in 1945, while Category:2022_films contains list of movies from 2022.

Obtain data#

The data are scraped from Wikipedia using category pages.

The produced years are from Category:Films_by_year.

The set-in years are from 3 categories:

Category:Films_by_century_of_setting
Category:Films_by_decade_of_setting
Category:Films_by_year_of_setting

See the scrape.ipynb notebook for more information.

Import modules#

Load data#

data_path = 'data/films_set_in_and_produced.csv'
df = pd.read_csv(data_path)
df

	pageid	title	source_year_setin	year_produced	year_setin	year_setin_type
0	3217	Army of Darkness	13th	1992	1200	century
1	3333	The Birth of a Nation	1860s	1915	1860	decade
2	3333	The Birth of a Nation	19th	1915	1800	century
3	3333	The Birth of a Nation	1870s	1915	1870	decade
4	3746	Blade Runner	2019	1982	2019	year
...	...	...	...	...	...	...
20150	74597836	The Sacrifice Game	1971	2023	1971	year
20151	74607682	Joachim and the Apocalypse	12th	2023	1100	century
20152	74612478	The Kitchen (2023 film)	2040s	2023	2040	decade
20153	74619750	Jailer (2023 Malayalam film)	1950s	2023	1950	decade
20154	74621258	Robot Dreams (film)	1980s	2023	1980	decade

20155 rows × 6 columns

Brief notes on data#

It is intuitive that a movie can be set in multiple years.

But, a movie entry (by pageid) can be appear in multiple Wikipedia categories ABCD films. Technically they’re not always produced in such years.

As I understand there may be different scenarios as to how this happens, for example release dates or an entry is a TV series.

df.query('pageid == 67231914')

	pageid	title	source_year_setin	year_produced	year_setin	year_setin_type
19286	67231914	Edge of the World (2021 film)	19th	2021	1800	century
19287	67231914	Edge of the World (2021 film)	19th	2023	1800	century

For example, Edge of the World (2021 film) belongs to both Category: 2021 films and Category: 2023 films. The Wikipedia page says

Edge of the World was released in the US […] on 4 June 2021, […] then in the UK […] on 21 June 2021.

In Malaysia and Brunei, the film has been slated as Rajah in the countries and scheduled on 9 March 2023 in cinemas.

Similarly, the Wikipedia box says

Release dates

4 June 2021 (US)

9 March 2023 (Malaysia and Brunei)

So this is due to different release dates.

df.query('pageid == 6819660')['year_produced'].unique()

array([1967, 1968, 1969, 1970, 1971, 1973])

Another example is Adventures of Mowgli, with 6 different years for production/release. This is not entirely like the above example with similar content and different release dates, but actually 5 different contents (1967-1971) and then later combined into one content released in 1973.

Adventures of Mowgli (Russian: Маугли; also spelled Maugli) is an animated feature-length story originally released as five animated shorts of about 20 minutes each between 1967 and 1971 in the Soviet Union. […]

In 1973, the five films were combined into a single 96-minute feature film.

Hence, take note that year_produced should be taken with a jitter of a few years in mind in some cases. Plus, it is not correct that is is produced, it may be a combination of production and release, maybe even (re)distribution.

For simplicity, I will not consider movie titles and ignore this for the time being, and instead analyze all pairs of (year_produced, year_setin).

Get time difference#

This is to calculate the time difference between year_setin and year_produced, with consideration of units.

If the time difference is positive, the movie is set in the future of the time that it is produced/released, e.g. post-apocalyptic movies.

If it is negative, the movie is set in the past, e.g. war movies.

df['year_diff'] = df['year_setin'] - df['year_produced']
df['diff_unit'] = df['year_setin_type']
df['num_years'] = df['diff_unit'].map({'century': 100, 'decade': 10, 'year': 1})
df['time_diff'] = df['year_diff'] / df['num_years']

df

	pageid	title	source_year_setin	year_produced	year_setin	year_setin_type	year_diff	diff_unit	num_years	time_diff
0	3217	Army of Darkness	13th	1992	1200	century	-792	century	100	-7.92
1	3333	The Birth of a Nation	1860s	1915	1860	decade	-55	decade	10	-5.50
2	3333	The Birth of a Nation	19th	1915	1800	century	-115	century	100	-1.15
3	3333	The Birth of a Nation	1870s	1915	1870	decade	-45	decade	10	-4.50
4	3746	Blade Runner	2019	1982	2019	year	37	year	1	37.00
...	...	...	...	...	...	...	...	...	...	...
20150	74597836	The Sacrifice Game	1971	2023	1971	year	-52	year	1	-52.00
20151	74607682	Joachim and the Apocalypse	12th	2023	1100	century	-923	century	100	-9.23
20152	74612478	The Kitchen (2023 film)	2040s	2023	2040	decade	17	decade	10	1.70
20153	74619750	Jailer (2023 Malayalam film)	1950s	2023	1950	decade	-73	decade	10	-7.30
20154	74621258	Robot Dreams (film)	1980s	2023	1980	decade	-43	decade	10	-4.30

20155 rows × 10 columns

Visualization#

Plot distribution of years set-in#

Show code cell source Hide code cell source

plt.figure(figsize=(18,5))

count_offset = 15

select_periods = {
    'US Civil War': [1861, 1865],
    'World War I': [1914, 1918],
    'World War II': [1939, 1945],
}

select_years = {
    1969: 'Apollo 11\nMoon Landing',
    1999: 'before Y2K',
    2012: 'defunct apocalypse',
    2023: 'present year'
}

select_year_list = list(select_years.keys())

df_year_target = df.query('year_setin_type == "year"')

select_counts = (
    df_year_target
    .query('year_setin in @select_year_list')
    .value_counts('year_setin')
    .to_frame('count')
    .join(
        pd.Series(select_years).to_frame('desc'), 
        how='outer'
    )
    .reset_index()
    .rename(columns={'index': 'year_setin'})
)

select_counts['count_offset'] = select_counts['count'] + count_offset

    
sns.histplot(
    df_year_target.query('year_setin > 1800 and year_setin < 2050'),
    x='year_setin',
    discrete=True,
    element='step',
    color='k',
    edgecolor='none',
    alpha=0.1
)


sns.scatterplot(
    select_counts,
    x='year_setin',
    y='count_offset',
    c='#12a4b6',
    s=25,
    marker='o',
    edgecolor='none'
)

for _, r in select_counts.iterrows():
    t_ha = 'left'
    t_va = 'center'
    t_yr = r['year_setin']
    c_os = r['count_offset']
    desc = r['desc']
    if pd.isna(desc):
        continue
    desc = f"{t_yr} ({desc})"
    
    x_text = t_yr + 2
    y_text = c_os
    
    if t_yr == 1969:
        # to avoid blocking
        x_text = t_yr
        y_text = c_os + 10        
        t_ha='center'
        t_va='bottom'
        
    plt.text(
        x_text, y_text, desc,
        fontdict=dict(
            size=15,
            ha=t_ha,
            va=t_va
        )
    )

for desc, (y0, ye) in select_periods.items():
    c_os = (
        df_year_target
        .query('year_setin >= @y0 and year_setin <= @ye')
        .value_counts('year_setin')
        .max()
    ) + count_offset
    
    plt.plot(
        [y0-0.5,ye+0.5], [c_os, c_os],
        color='#12a4b6', lw=2, 
        marker='.', markersize=10
    )
    
    desc = f'{desc}\n({y0} - {ye})'
    
    plt.text(
        (y0 + ye)/2, c_os + count_offset/2, desc,
        fontdict=dict(
            size=15,
            ha='center',
            va='bottom',
        )
    )

plt.xlabel('Years that movies are set in (restricted to 1800 - 2050, set-in type = "year")')
plt.ylabel('Number of movies')
plt.suptitle('Distribution of years that movies are set in', fontsize=20)

sns.despine(trim=True, offset=10)
plt.tight_layout()

plt.savefig('figures/distrib-years-setin.pdf')
plt.savefig('figures/distrib-years-setin.png')

../_images/0ed9796ff0da240ce31e4af4a27e28a9cd3728d5c0a1bfb96b2f92bd979d0b9c.png

(
    df_year_target
    .query('year_setin in [1944,1945]')
    .value_counts('year_setin')
)

year_setin
1944    195
1945    167
dtype: int64

Many movies tend to be set in war periods, especially World War II (1939 - 1945). Interesting, there are more movies set in 1944 (195 movies) than in 1945 (167 movies). Not sure why.

Additionally, besides war periods, some notable events are annotated in the distribution and have high peaks as well, such as:

1969: Apollo 11 Moon Landing
1999: the year before Y2K. I annotated as such because I suspect this is why there’s a high peak in 1999 instead of 2000.
2012: the supposed end-of-the-world

Plot time difference between set-in and produced#

../_images/b31b0b279441065e18a1dec01b616dcaebb7e2c1d129f434e3a1902e855fd893.png

Considerable number of movies are set in the past, even more so than setting in the present. The tail towards the past is much heavier than it is for the future.