Webucator Blog

Simple Python Script for Extracting Text from an SRT File

Watching movies or TV shows in a foreign language is great for learning that language, but it can be challenging. Quick speech, slang, and background noise can all make understanding more challenging. I find it helpful to have subtitles that match the speech, but foreign-language films/shows don’t always have subtitles. Fortunately, you can often find subtitle files (with a .srt extension) at opensubtitles.org. Unfortunately, those files aren’t easy to read, because they are marked up with timestamps and include every sound made (e.g., mobile phone ringing).

For example, we are currently watching El Ministerio del Tiempo on Movistar in Spain. The SRT file of season 1, episode 3 (available here) begins like this:

1
00:00:33,599 --> 00:00:35,270
(NARRA) "Soy Amelia Folch.

2
00:00:36,199 --> 00:00:39,870
Tengo 23 años y sin embargo
he salvado la vida del Empecinado.

3
00:00:45,160 --> 00:00:46,550
(Disparo)

4
00:00:48,800 --> 00:00:50,310
He conocido a Lope de Vega.

5
00:00:56,400 --> 00:00:58,080
Y he visto la Armada Invencible.

I wanted to reduce that to:

(NARRA) "Soy Amelia Folch.
Tengo 23 años y sin embargo he salvado la vida del Empecinado.
He conocido a Lope de Vega.
Y he visto la Armada Invencible.

Here’s the solution I came up with.

Run it like this:

python srt_to_txt.py file_name.srt cp1252

Note that the script assumes that lines beginning with lowercase letters or commas are part of the previous line and lines beginning with any other character are new lines. This won’t always be correct, but it does a good enough job to make it easy to follow along with the movie.


Related Training: Python