Speech recognition software for Linux

As of the early 2000s, several speech recognition (SR) software packages exist for Linux. Some of them are free and open-source software and others are proprietary software. Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for communicating operational commands to a computer.

Linux native speech recognition[]

History[]

In the late 1990s, a Linux version of ViaVoice, created by IBM, was made available to users for no charge. In 2002, the free software development kit (SDK) was removed by the developer.

Development status[]

In the early 2000s, there was a push to get a high-quality Linux native speech recognition engine developed. As a result, several projects dedicated to creating Linux speech recognition programs were begun, such as Mycroft, which is similar to Microsoft Cortana, but open source.

Speech sample crowdsourcing[]

It is essential to compile a speech corpus to produce acoustic models for speech recognition projects. VoxForge is a free speech corpus and acoustic model repository that was built with the aim of collecting transcribed speech to be used in speech recognition projects. VoxForge accepts crowdsourced speech samples and corrections of recognized speech sequences. It is licensed under a GNU General Public License (GPL).

Speech recognition concept[]

The first step is to begin recording an audio stream on a computer. The user has two main processing options:

Discrete speech recognition (DSR) – processes information on a local machine entirely. This refers to self-contained systems in which all aspects of SR are performed entirely within the user's computer. This is becoming critical for protecting intellectual property (IP) and avoiding unwanted surveillance (2018).
Remote or server-based SR – transmits an audio speech file to a remote server to convert the file into a text string file. Due to recent cloud storage schemes and data mining, this method more easily allows surveillance, theft of information, and inserting malware.

Remote recognition was formerly used by smartphones because they lacked sufficient performance, working memory, or storage to process speech recognition within the phone. These limits have largely been overcome although server-based SR on mobile devices remains universal.

Speech recognition in browser[]

Discrete speech recognition can be performed within a web browser and works well with supported browsers. Remote SR does not require installing software on a desktop computer or mobile device as it is mainly a server-based system with the inherent security issues noted above.

Remote: The dictation service records an audio track of the user via a web browser.
DSR: There are solutions that work on a client only, without sending data to servers.

Free speech recognition engines[]

The following is a list of projects dedicated to implementing speech recognition in Linux, and major native solutions. These are not end-user applications. These are programming libraries that may be used to develop end-user applications.

CMU Sphinx is a general term to describe a group of speech recognition systems developed at Carnegie Mellon University.
HTK is the most famous and widely used speech recognition software prior to Kaldi.
Julius is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers.
Kaldi a toolkit for speech recognition provided under the Apache licence.
Mozilla DeepSpeech is developing an open source Speech-To-Text engine based on Baidu's deep speech research paper.^[1]

Possibly active projects:

Parlatype, audio player for manual speech transcription for the GNOME desktop, provides since version 1.6 continuous speech recognition with CMU Sphinx.^[2]
Lera (Large Vocabulary Speech Recognition) based on Simon and CMU Sphinx for KDE.^[3]
Speech^[4] uses Google's speech recognition engine to support dictation in many different languages.
Speech Control: is a Qt-based application that uses CMU Sphinx's tools like SphinxTrain and PocketSphinx to provide speech recognition utilities like desktop control, dictation and transcribing to the Linux desktop.
Platypus^[5] is an open source shim that will allow the proprietary Dragon NaturallySpeaking running under Wine to work with any Linux X11 application.
FreeSpeech,^[6] from the developer of Platypus, is a free and open source cross-platform desktop application for GTK that uses CMU Sphinx's tools to provide voice dictation, language learning, and editing in the style of Dragon NaturallySpeaking.
Vedics^[7] (Voice Enabled Desktop Interaction and Control System) is a speech assistant for GNOME Environment
NatI^[8] is a multi-language voice control system written in Python
SphinxKeys^[9] allows the user to type keyboard keys and mouse clicks by speaking into their microphone.
VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines.
Simon^[10] aims at being extremely flexible to compensate dialects or even speech impairments. It uses either HTK – Julius or CMU SPHINX, works on Windows and Linux and supports training.
Jasper project^[11] Jasper is an open source platform for developing always-on, voice-controlled applications. This is an embedded Raspberry Pi front-end for CMU Sphinx or Julius

It is possible for developers to create Linux speech recognition software by using existing packages derived from open-source projects.

Inactive projects:

CVoiceControl^[12] is a KDE and X Window independent version of its predecessor KVoiceControl. The owner ceased development in alpha stage of development.
Open Mind Speech,^[13] a part of the Open Mind Initiative,^[14] aims to develop free (GPL) speech recognition tools and applications, and collect speech data. Production ended in 2000.
PerlBox^[15] is a perl based control and speech output. Development ended in early stages in 2004.
Xvoice^[16] A user application to provide dictation and command control to any X application. Development ended in 2009 during early project testing. (requires proprietary ViaVoice to function)

Proprietary speech recognition engines[]

Janus Recognition Toolkit (JRTk)^[17] is a closed source speech recognition toolkit mainly targeted at Linux developed by the Interactive Systems Laboratories developed at Carnegie Mellon University and Karlsruhe Institute of Technology for which commercial and research licenses are available.

Voice control and keyboard shortcuts[]

Speech recognition usually refers to software that attempts to distinguish thousands of words in a human language. Voice control may refer to software used for sending operational commands to a computer or appliance. Voice control typically requires a much smaller vocabulary and thus is much easier to implement.

Simple software combined with keyboard shortcuts, have the earliest potential for practically accurate voice control in Linux.

Running Windows speech recognition software with Linux[]

Via compatibility layer[]

It is possible to use programs such as Dragon NaturallySpeaking in Linux, by using Wine, though some problems may arise, depending on which version is used.^[18]

Via virtualized Windows[]

It is also possible to use Windows speech recognition software under Linux. Using no-cost virtualization software, it is possible to run Windows and NaturallySpeaking under Linux. VMware Server or VirtualBox support copy and paste to/from a virtual machine, making dictated text easily transferable to/from the virtual machine.

References[]

^ "A TensorFlow implementation of Baidu's DeepSpeech architecture". Mozilla. 2017-12-05. Retrieved 2017-12-05.
^ Parlatype 1.6 released, Apr 24, 2019, http://gkarsay.github.io/parlatype/2019/04/24/v1.6.html Retrieved 2019-05-12.
^ Lera KDE git repository – (2015) – https://cgit.kde.org/scratch/grasch/lera.git/ Retrieved 2017-07-25.
^ "andre-luiz-dos-santos/speech-app". GitHub. 2018-07-12.
^ "The Nerd Show – Platypus". thenerdshow.com.
^ "FreeSpeech Realtime Speech Recognition and Dictation". TheNerdShow.com.
^ "Vedics".
^ "rcorcs/NatI". GitHub. 2018-09-24.
^ "worden341/sphinxkeys". GitHub. 2016-07-11.
^ Simon KDE – Main Developer until 2015 Peter Grasch – (accessed 2017/09/04) – [1]
^ "Jasper". GitHub.
^ Kiecza, Daniel. "Linux". Kiecza.net.
^ "Open Mind Speech – Free Speech Recognition for Linux". freespeech.sourceforge.net.
^ "Open Mind Initiative". Archived from the original on 2003-08-05. Retrieved 2019-03-16.
^ "Perlbox.org Linux Speech Control and Voice Recognition". perlbox.sourceforge.net.
^ "Xvoice". xvoice.sourceforge.net.
^ (IAR), Roedder, Margit (26 January 2018). "KIT – Janus Recognition Toolkit". isl.ira.uka.de.
^ "WineHQ – Dragon Naturally Speaking". appdb.winehq.org.

External links[]

Accessibility, SpeechRecognition – Ubuntu Help

[1] "A TensorFlow implementation of Baidu's DeepSpeech architecture". Mozilla. 2017-12-05. Retrieved 2017-12-05.

[2] Parlatype 1.6 released, Apr 24, 2019, http://gkarsay.github.io/parlatype/2019/04/24/v1.6.html Retrieved 2019-05-12.

[3] Lera KDE git repository – (2015) – https://cgit.kde.org/scratch/grasch/lera.git/ Retrieved 2017-07-25.

[4] "andre-luiz-dos-santos/speech-app". GitHub. 2018-07-12.

[5] "The Nerd Show – Platypus". thenerdshow.com.

[6] "FreeSpeech Realtime Speech Recognition and Dictation". TheNerdShow.com.

[7] "Vedics".

[8] "rcorcs/NatI". GitHub. 2018-09-24.

[9] "worden341/sphinxkeys". GitHub. 2016-07-11.

[10] Simon KDE – Main Developer until 2015 Peter Grasch – (accessed 2017/09/04) – [1]

[11] "Jasper". GitHub.

[12] Kiecza, Daniel. "Linux". Kiecza.net.

[13] "Open Mind Speech – Free Speech Recognition for Linux". freespeech.sourceforge.net.

[14] "Open Mind Initiative". Archived from the original on 2003-08-05. Retrieved 2019-03-16.

[15] "Perlbox.org Linux Speech Control and Voice Recognition". perlbox.sourceforge.net.

[16] "Xvoice". xvoice.sourceforge.net.

[17] (IAR), Roedder, Margit (26 January 2018). "KIT – Janus Recognition Toolkit". isl.ira.uka.de.

[18] "WineHQ – Dragon Naturally Speaking". appdb.winehq.org.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]