Extracting Hardsubs

This is an addendum to the guide from Zalis from here. You might want to read that guide first if you haven't read it yet.

Method Three: OCRing using VideoSubFinder

Required Tools: 

- VideoSubFinder
- FineReader (I use FineReader 11, though older versions might be suitable too). Don't ask me for it - find it on the torrent sites, buy it from Abbyy, or whatever.
- Aegisub

Advantages:

- Does not require much user interaction until the error checking in Aegisub.
- Has no problems with multicolor subs and even creates different ASS styles trying to mimic the original colors.
- Finds subtitles in any part of the screen, be it top, bottom or middle.
- Using custom patterns in FineReader, can be trained to recognize fancy/non-standard fonts with acceptable quality.

Disadvantages/caveats:

- The program is very slow and sometimes crashes.
- Some lines might be missed. This is not too often and is easy to catch during the scenetiming/QC stage.
- When two lines are very similar in shape and appear one after another without flickering, VSF sometimes doesn't recognize that those are different lines, OCRs the first line and extends its time to the end of the second line. This is the error I hate the most because you can't see it during error checking in Aegisub and you can even miss it during the later scenetiming/QC stage if you are not watchful enough.
Example: Yakitate Japan Ep7 ~06:18: two short lines "This taste..."/"This filling..." were recognized as one line "This taste..." which spans through both phrases.
The good thing is that this does not happen often, maybe once per 10 episodes on average.
(BTW if you transcribe the subs using Russian subs as a template, you should also be wary of this situation. A lot of Russian subs were created from English hardsubs using VSF and such errors were sometimes left unnoticed; I encountered this several times.)


Part 1. Recognition.

1. Start VideoSubFinder. Normally I go to the Settings tab and uncheck "Using fast version (partially reduced)", though I'm not sure it really makes any difference. I leave all other settings at default values.
2. Load the hardsub using menu item File -> Open Video All Default.
3. Press "Clear Folders" on the Search tab. Don't forget to do this after loading any new video file.
4. Press "Run Search" on the Search tab. This process is rather long (especially if you unchecked "Using fast version"), so you can go do something else until it completes.
5. The original Russian guide suggests looking through the images in RBGImages folder and deleting false positives, i.e. images with no subtitles. Usually I'm too lazy to do this. But at the very least, you should delete OP/ED karaoke. It won't be recognized well anyway.
6. After that, go to OCR tab, press "Create Cleared TXT Images" and go do something else until it completes.
7. Start FineReader.
8. Open all images from the TXTImages folder. (In FineReader 11, press Open on the toolbar, navigate to TXTImages and select all files using Ctrl+A.)
9. Press "Read" to OCR the images.
10. Press "Save" and save the recognized results as "Text (*.txt)" into TXTResults folder. Use "Name files as source images" option.
11. Return to VideoSubFinder and press "Create Sub From TXT Results" button. The subtitles will be saved to "sub.ass" file in the main VideoSubFinder folder.

Part 2. Fixing OCR errors.

1. One annoying "feature" of VideoSubFinder is that it sometimes recognizes multiline subtitles as multiple separate lines appearing at the same time and explicitly positions one above the other using \pos in the .ass file. The result is visually the same, but only until you change margins/font size/etc. To be able to fix this faster, before opening the .ass file in Aegisub, I normally open it in a regex-capable text editor and delete all ASS tags by replacing "\{.*?\}" with an empty string.
2. After that, open the .ass file in Aegisub, select all lines and set vertical margin to 90.
3. Load the original hardsubbed video file (Video -> Open Video).
4. Go line-by-line, compare the recognized text with the original text and fix any errors you see. When multiple lines were recognized as separate lines, select them and use "Join (concatenate)", preceeded by "Swap" or Shift-Up/Down if they are swapped. Delete any unneeded garbage lines using Ctrl-Del.

This process should take about 10 minutes per episode. If it takes much more, it means the OCR is not too efficient for this show and you might want to use the transcription method as described in the guide from Zalis. Or you might try to improve recognition quality by creating a custom pattern.

Part X. Creating custom pattern in FineReader

FineReader can be trained to recognize a certain font better. These instructions are based on version 11 of FineReader.
1. Create a new pattern using Tools/Pattern Editor/New.
2. Go to Tools/Options and switch on "Read with training" on the Read tab. Also make sure "Thorough reading" and "Use built-in and user patterns" are selected.
3. Select one of the pages with subs in the left list and press Page/Read Page (Ctrl-R). FineReader will ask you to input any uncertain characters, similar to what SubRip does. If you accidentally type wrong character you can go to Pattern Editor and delete it.
4. Repeat the previous step until you are satisfied with recognition result. After that, switch off "Read with training" and press "Read" button on the toolbar to OCR the remaining pages.
5. When you want to OCR another episode of the same show later, make sure to set your new custom pattern as Active in the Patter Editor.

Part 3. Remaining work (scenetiming/typesetting/adding missing lines/QC/insert songs/etc.)

The subtitles created using the previous steps should be quite watchable. However, to create really good subs, some additional work is needed.
1. Load the resulting subtitle file with the final (DVD) video in Aegisub.
2. Shift the times as needed for DVD.
3. Run Aegisub Timing Post-Processor. My normal settings are "Add lead-out 300ms" (depending on the original timings, maybe not needed); "Make adjacent subtitles continuous" with threshold 430ms and all bias towards the End; "Keyframe snapping" 5/4/10/8. Your mileage may vary.
4. Then I watch the final video from the beginning to the end inside Aegisub, having the original hardsub ready in another window for reference. I pause the playback as needed and change the start/end times of the individual lines so that there is no flickering, no scene-bleeding, the line doesn't appear too early or too late and is not too short. Also add any missing lines, fix remaining OCR errors and typos, get rid of 3-liners, detect and typeset the signs along the way. Normally I do all of that in one pass, because doing more than one pass per episode would be too boring.

Also, at this stage I may change the timings to reflect my personal view on lead-in/lead-out, which derives from the pattern I use when watching something with subtitles: FIRST look at the video and hear the audio, THEN move the eyes down to read the subs. Thus, subtitles should appear at the same time or even a bit later than the audio starts, and disappear later than the audio ends. This rule can be bended depending on scene-changes and other circumstances. Unlike some groups, I do not have any specific numbers of milliseconds in mind and just use subjective approach.

This pass (let's call it just "scenetiming" for brevity) can take 1 hour or more per episode, depending on the amount of signs and the quality of the original timings. It is, however, mostly optional.


Recommended reading

Extracting Hardsubs guide from Zalis
Timing guide from m.3.3.w - I may not agree with everything they say about styling, but their seven Timing Rules are a must.
Styling guide from Underwater
Typesetting in Aegisub guide

12 comments:

  1. id like to add something to your guide...you can use the 4 corners of the videosubfinder window to specify the area of the window u want it to look at

    ReplyDelete
  2. Where would you create the new tools? I have FineReader 11 Corporate Edition and I don't see "Tools/Pattern Editor/New".

    ReplyDelete
  3. never mind, I found it. Thanks for the tutorial! ^_^

    ReplyDelete
  4. I have a problem. When I go to pres Create Sub From TXT Results I get an eror...What can I do?

    ReplyDelete
    Replies
    1. I also had the same problem. Before the "Create Cleared TXT Images" step, go through the FRDImages folder and delete any images that might be hard for ABBYY to decipher (e.g. vertical text, non-Romanized text, angled text). While you are doing this, make sure to delete the counterpart image in the RGBImages folder. Do the rest of the steps as usual.

      On another note, I prefer editing with Subtitle Edit, but the VSF subs leave a huge gap between the first letter of a line and the rest of the line (it's not noticeable in Aegisub or mediaplayers...it just bothers me). To fix this, open the .srt subs from VSF in Aegisub. Save as a .ass file. Should look normal in Subtitle Edit now :)

      Delete
  5. HI. thanks for giving out this guide.

    I tried all the steps. I have now finished running searching.

    However, "Create Clear TXT Images" always lead to crashing the program.. It will do like ~2-3 mins only before crash.

    There must be something which causes the crash and I couldn;t figure it out.

    ReplyDelete
  6. Yes. Sometimes it crashes when it tries to clear a certain image. In this case you just have to identify which image causes the crash and delete it.

    When you press "Create Clear TXT Images", it takes the images from "RBGImages" subfolder, cleans them and copies the result into "TXTImages" subfolder. Now what you can do is go to "TXTImages" subfolder and check which image is the last one there (when sorted alphabetically). This is the last image which was processed successfully. The next one caused the crash. So, note the file name of the last (good) image in "TXTImages", then go to "RBGImages", find the file with the same name and delete the NEXT file (sorted alphabetically) which causes the crash. Then, press "Create Clear TXT Images" again.

    Also you might want to look at the image before deleting it and note the timestamp (it's part of the file name), because if the image contains a subtitle line, this line will be missing from the resulting .ass file and you'll have to insert it manually later.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. how to improve TXTimages, some characters are wrong, incomplete, missing some parts. davidnx@189.cn

    ReplyDelete