First is a comparison table of some of the automated options. Below the table are instructions for each option.
| Option | Cost | Transcription Formats |
File Type Generated | Languages Supported |
Ethics Considerations |
|---|---|---|---|---|---|
| Microsoft 365 – Word Transcription | $0 (but limited to 300 minutes per month) | just text; with speakers; with timestamps; with speakers and timestamps | Word document | 80+ languages/dialects | UofT’s Research Ethics Board has approved this approach in the past if you stated that you were keeping all files on OneDrive using multifactor authentication (but of course that depends on your particular situation and what you wrote in your research ethics protocol). Read more information on how the service works under the About Transcribe heading at the bottom of the page. |
| aTrain | $0 | with speakers; with speakers and timestamps | Text file | 57 languages | This is a program you run locally on your computer (with no need for internet access), so it is very likely that the Research Ethics Board would approve this approach. |
| Zoom | $0 | with timestamps; with speakers and timestamps | VTT file; Text file | 49 languages/dialects | Keep in mind that the recording and transcript are stored on Zoom’s servers in Canada (but Zoom's US servers are still used for real-time data processing). This may or may not be acceptable for research ethics. |
| Microsoft 365 – Clipchamp Transcription | $0 | with timestamps | VTT file or Word Document | 80+ languages/dialects | This is similar to using the Microsoft 365 Word solution above. |
| YouTube | $0 | with timestamps | VTT file | 67 languages | Keep in mind that the recording and transcript are stored on YouTube's servers. This may or may not be acceptable for research ethics. |
| NVivo Transcription | $25USD/hour; cheaper bulk purchases available | with speakers and timestamps | Word document; Text file | 43 languages | Keep in mind that the recording and transcript are stored on Lumivero’s servers. This may or may not be acceptable for research ethics. Read more about their data security. |
| MAXQDA Transcription | Different options from $23.80 USD for 2hrs up to $178.50 USD for 20hrs. You can also get 60 minutes free to get started. | with speakers and timestamps | Text file | almost 50 languages | Keep in mind that the recording and transcript are stored on MAXQDA's servers. This may or may not be acceptable for research ethics. Read more about their data security. |
A. Microsoft 365 – Word Transcription
Summary: Microsoft Word 365 online works with both audio and video files and is available to UofT faculty, staff, and students. You can use the transcribe feature to get a Word document transcript with just text, speaker names, timestamps, or speaker names and timestamps.
Instructions:
- Go to Word 365 online.
- Enter in your UofT email address. Then it will take you to a page where you will log in using your UTORID credentials.
- Create a new blank document.
- Click on drop-down arrow next to the Dictate icon (i.e., the microphone icon) from Home Ribbon Menu, then select Transcribe.
- From the right-side box, pick your language and upload your audio or video file.
- You will see a progress bar as it transcribes - takes a bit of time, but pretty fast. Wait for it to be done.
- When done, you should see a preview of the transcription.
- Below the transcription, click on the drop-down arrow next to the Add to Document button.
- Select the option you want: just text, with speakers, with timestamps, with speakers and timestamps. It will then add the transcript to the Word document that you can then edit and/or download.
- It will then add the transcript to the Word document that you can then edit and/or download.
Note: Currently, there is a limit of 300 minutes per month per user. (In the past, Microsoft offered unlimited minutes, but that has unfortunately changed.)
B. aTrain
Summary: This is a free, open-source application you run locally on your computer. It will transcribe both your audio and video files, creating text files with speaker names and timestamps. aTrain combines OpenAI's Whisper transcription models with speaker recognition and provides outputs that integrate with MAXQDA and ATLAS.ti.
Instructions: For Windows, you can download and install it from the Microsoft store. Or for Windows, MacOS, and Linux, you can use the Download aTrain page. Then you just select your audio or video file.Optionally provide information on what language the file is in and how many speakers there are (although speaker identification works better if you specify this). You can also decide what level of model you'd like to use to run the transcription.
aTrain comes installed with the large-v3-turbo model to start, but you can select Models from the left menu and use the download buttons to download other models (generally the larger the model, the more accurate the transcription, but at a sacrifice to speed). You can read more about the models in the documentation.
Once your settings are selected, click on Start. You should see a progress screen. A message will tell you when it is done. You can click on Open to open up a folder with your transcripts created in various formats. If the Open button does not work, you should be able to manually browse to this folder by going to your Documents\aTrain\transcriptions folder. The transcription.txt file just has speaker labels. The transcription_timestamps.txt just has timestamps. The transcription_maxqda.txt has both speaker labels and timestamps.
aTrain also provides a paper with more details on its use.
C. Zoom
Summary: Zoom works with both audio and video files and is available to UofT faculty, staff, and students. You can also use Zoom to capture your audio and video recordings. You generally have two options:
- Create transcripts live
- Create transcripts from cloud recordings (Note, though, that undergraduate students can’t record to cloud)
Instructions:
- First decide if you are going to use Zoom to also capture your audio or video recordings of an interview, focus group, etc. If so, start your Zoom meeting
- If instead you have an audio or video file you captured from somewhere else, use Zoom to host a meeting for one, share your screen, including system audio, and be ready to play your video or audio files (as if presenting a webinar)
- In either case, once you are ready to proceed, follow the steps below for Option 1 or 2
Option 1: Live Captioning
- Turn on live captioning by going to the drop-down arrow next to the Show Captions option in the Zoom menu, and selecting the language for your captions
- Then you can either Show/Hide Captions by toggling that option in the Zoom menu. Toggle the captions so that they are hidden (but are happening in the background)
- Zoom will auto caption it live. You can download the transcript when it is done, by selecting View Full Transcript from the drop-down menu for Show Captions
- Click on the Save transcript button at the bottom of the transcript window. This results in a text file you can download
- If you were the only person in the meeting, it will tag all the text with your Zoom name. You would have to edit it if you wanted to label appropriate speaker names. If you were conducting your interview live, then it should automatically label appropriate speaker names
Option 2: Cloud Recording Transcription
- Record your meeting to the cloud by click on the More options in the Zoom menu and selecting Record -> Record to the cloud
- Conduct your meeting as normal and then leave the meeting when you are done. You will get notified when the recording is ready
- Go to the web view of UofT’s Zoom and login with your account
- Select Recordings & Transcripts from the left menu
- Select the title of the meeting you just recorded. This screen will tell you if the recording is still being transcribed or if it is done. If it says unable to transcribe, make sure to select your language by hovering over audio transcript
- When it is ready, you can select Audio transcript to download a VTT file with timestamps and speaker labels.
- If you were the only person in the meeting, it will tag all the text with your Zoom name. You would have to edit it if you wanted to label appropriate speaker names. If you were conducting your interview live, then it should automatically label appropriate speaker names
D. Microsoft 365 - Clipchamp Transcription
Summary: Clipchamp is an option available to UofT faculty, staff, and students that works with videos files only to create VTT files (captions with text and timestamps). Generally, VTT files are not transcripts, but can help you speed up the process of creating transcripts, in some cases.
Instructions:
- Go to Clipchamp 365 online.
- Enter in your UofT email address. Then it will take you to a page where you will log in using your UTORID credentials.
- Once logged in, either select your video from the content list of videos in your OneDrive OR you can click on Upload on the left, next to the Screen recording button to upload a new file. Browse to your video file and select it to upload.
- Once uploaded and showing up in the content list, click on it to play in Clipchamp.
- Click on Video settings on the Right.
- Expand the Transcript and captions section by clicking on its drop-down arrow.
- Click on Generate and select the language to generate captions.
- Once finished, you should see the captions listed in that section saying the language and below “Generated by Microsoft”. Close the Video settings.
- If you click on the Transcript option that is now available on the right, you can view the video and transcript side-by-side if you want to make any edits.
- When done, go back to Video Settings, to the Transcript and captions section. Next to the generated caption listed, click on the … icon for those captions, and select Download to download as a .docx or .vtt file.
But note that there won’t be any speakers’ names in the file; you would have to add those manually if you wanted them. Because you’ll have to add speaker names and timestamp divisions might not occur at a change in speaker, this could be a labour-intensive process to augment and clean up, depending on your file (for example, there would be a lot more cleanup required for a focus group transcript). Also, Clipchamp seems to generate more timestamp divisions than other tools.
Also, as mentioned, this only works for video files – see the Notes on Converting Audio Only Files to Videos section, if needed.
E. YouTube
Summary: YouTube is a free option that works with videos only to create VTT files (captions with text and timestamps). Generally, VTT files are not transcripts, but can help you speed up the process of creating transcripts, in some cases. If the video is not private or sensitive (and so you are comfortable uploading it to YouTube and it complies with your research ethics approval, if applicable), you could use YouTube’s free auto transcription service to create a VTT file.
You don't need to share the video publicly because you can upload it as a private file to your account. After uploading the video, you can use the caption service to generate a VTT file, which you can then correct and download.
But note that there won’t be any speakers’ names in the file; you would have to add those manually if you wanted them. Because you’ll have to add speaker names and timestamp divisions might not occur at a change in speaker, this could be a labour-intensive process to augment and clean up, depending on your file (for example, there would be a lot more cleanup required for a focus group transcript).
Also, as mentioned, this only works for video files – see the Notes on Converting Audio Only Files to Videos section, if needed.
Instructions:
- Follow the upload your video instructions (or the instructions if your video is longer than 15 minutes)
- Then follow the instructions to use the caption service and generate a VTT file
- Then review and correct the VTT file captions
- Finally follow the instructions to download the corrected VTT file
F. NVivo Transcription
Summary: Lumivero offers a paid automated transcription service where you upload audio and video files to transcribe. You are able to edit the files in the online interface and download the transcripts as text or Word files when they are ready.
Instructions:
- First sign up for and purchase the NVivo transcription service
- Then follow Lumivero’s step-by-step instructions and how-to video for more information
G. MAXQDA Transcription
Summary: MAXQDA offers a paid automated transcription service where you upload audio and video files to transcribe and then receive text transcripts, which you can then edit from within MAXQDA.
Instructions: See the bottom of MAXQDA's transcription page for step-by-step instructions for two methods they offer for transcribing with their service.
Notes on Converting Audio Only Files to Videos
For YouTube and Clipchamp, these tools work with video files only. For audio only files, you will have to turn it into a simple video to do this (so add a still image and save as a mp4 file). One way to do that would be to create one slide in PowerPoint, add the audio, and use PowerPoint to export the slideshow as a video mp4 file. Note: Normally the audio quality is reduced with this method, which can affect the quality of the transcription.
Windows Instructions:
- Open up PowerPoint and create a new blank presentation
- Select Insert -> Audio -> Audio on My PC… and select your audio file. You should see a new audio icon appear in the middle of the slide. You can click on the play button to hear the audio
- With that audio icon selected, you should see a Playback ribbon menu at the top appear. From the drop-down menu next to Start (in the middle), select Automatically
- Select File-> Export -> Create a Video
- Decide on your file quality (I would go with Full HD or better as lowering the video quality will also affect the audio quality, and you would like good audio quality for transcription)
- Select Don’t Use Recorded Timings and Narrations and seconds spent on each slide to 0
- Click on Create Video. You will be prompted to name the file and decide where to save it (keep the default to save it as a MPEG-4 Video). This may take some time if you have a long audio recording
Now you have a video file (built from your audio file) that you can use for transcription.
For Mac, the instructions are similar.
Cleaning VTT files
If you want an automated way to strip out timestamps and numbering in Zoom transcript files, you can follow these instructions using REGEX and a text editor, such as Notepad++. (Generally, REGEX is a powerful way to identify patterns in text and could be used in a variety of ways to clean up transcripts)
Advice on Workflows for using VTT files in NVivo and MAXQDA
The article “Auto-Creating, Correcting and Coding Transcripts from Microsoft Teams or Zoom in CAQDAS Software (ATLAS.ti, NVivo or MAXQDA)” discusses the general process of creating your own transcripts, cleaning up VTT files, and then bringing in those files along with your audio/video files into NVivo or MAXQDA to work with them there.
Other Resources on Captions/Transcripts/De-Identification
- Centre for Teaching Support & Innovation Guide to Captioning Videos
- De-Identification: Depending upon the research, there are situations where you may be sharing your data and need to protect your human participants' identities, so that someone viewing the data couldn't figure out who the data is associated with. These resources can provide more information and tips, if you need to de-identify your transcripts:
Also, visit our Getting Started pages for more information, tutorials, and workshops for NVivo or MAXQDA!