#interspeech | Explore Tumblr Posts and Blogs

trustmeimtherealhatsunemiku · 1 year

Note

hhi how is it like beign a vogaloid?

Vocaloid (ボーカロイド, Bōkaroido) is a singing voice synthesizer software product. Its signal processing part was developed through a joint research project led by Kenmochi Hideki at the Pompeu Fabra University in Barcelona, Spain, in 2000 and was not originally intended to be a full commercial project.[1] Backed by the Yamaha Corporation, it developed the software into the commercial product "Vocaloid" that was released in 2004.[2][3]

The software enables users to synthesize "singing" by typing in lyrics and melody and also "speech" by typing in the script of the required words. It uses synthesizing technology with specially recorded vocals of voice actors or singers. To create a song, the user must input the melody and lyrics. A piano roll type interface is used to input the melody and the lyrics can be entered on each note. The software can change the stress of the pronunciations, add effects such as vibrato, or change the dynamics and tone of the voice.

Various voice banks have been released for use with the Vocaloid synthesizer technology.[4] Each is sold as "a singer in a box" designed to act as a replacement for an actual singer.[1] As such, they are released under a moe anthropomorphism. These avatars are also referred to as Vocaloids, and are often marketed as virtual idols; some have gone on to perform at live concerts as an on-stage projection.[5]

The software was originally only available in English starting with the first Vocaloids Leon, Lola and Miriam by Zero-G, and Japanese with Meiko and Kaito made by Yamaha and sold by Crypton Future Media. Vocaloid 3 has added support for Spanish for the Vocaloids Bruno, Clara and Maika; Chinese for Luo Tianyi, Yuezheng Ling, Xin Hua and Yanhe; and Korean for SeeU.

The software is intended for professional musicians as well as casual computer music users.[6] Japanese musical groups such as Livetune of Toy's Factory and Supercell of Sony Music Entertainment Japan have released their songs featuring Vocaloid as vocals. Japanese record label Exit Tunes of Quake Inc. also have released compilation albums featuring Vocaloids.[7][8]

Technology

Voice model developed before the Vocaloid, Excitation plus Resonances (EpR) model,[9] is a combination of:

Spectral Modeling Synthesis (SMS)

Source-Filter model

The model was developed in 2001 as a source-filter model for voice synthesis,[10] but was only implemented on top of the concatenative synthesis model in the final product[citation needed] as a method of avoiding spectral shape discontinuities at the segment boundaries of concatenation.[11]

(based on Fig.1 on Bonada et al. 2001)

Vocaloid's singing synthesis [ja] technology is generally categorized into the concatenative synthesis[12][13] in the frequency domain, which splices and processes the vocal fragments extracted from human singing voices, in the forms of time-frequency representation. The Vocaloid system can produce the realistic voices by adding vocal expressions like the vibrato on the score information.[14] Initially, Vocaloid's synthesis technology was called "Frequency-domain Singing Articulation Splicing and Shaping" (周波数ドメイン歌唱アーティキュレーション接続法, Shūhasū-domain Kashō Articulation Setsuzoku-hō) on the release of Vocaloid in 2004,[15] although this name is no longer used since the release of Vocaloid 2 in 2007.[16] "Singing Articulation" is explained as "vocal expressions" such as vibrato and vocal fragments necessary for singing. The Vocaloid and Vocaloid 2 synthesis engines are designed for singing, not reading text aloud,[17] though software such as Vocaloid-flex and Voiceroid have been developed for that. They cannot naturally replicate singing expressions like hoarse voices or shouts.[18]

System architecture

Vocaloid system diagram

(based on Fig.1 on Kenmochi, Ohshima & , Interspeech 2007)

The main parts of the Vocaloid 2 system are the Score Editor (Vocaloid 2 Editor), the Singer Library, and the Synthesis Engine. The Synthesis Engine receives score information from the Score Editor, selects appropriate samples from the Singer Library, and concatenates them to output synthesized voices.[3] There is basically no difference in the Score Editor and the Synthesis Engine provided by Yamaha among different Vocaloid 2 products. If a Vocaloid 2 product is already installed, the user can enable another Vocaloid 2 product by adding its library. The system supports three languages, Japanese, Korean, and English, although other languages may be optional in the future.[2] It works standalone (playback and export to WAV) and as a ReWire application or a Virtual Studio Technology instrument (VSTi) accessible from a digital audio workstation (DAW).

Score Editor

Score Editor (example)

0:29

Song example: "Sakura Sakura"

The Score Editor is a piano roll style editor to input notes, lyrics, and some expressions. When entering lyrics, the editor automatically converts them into Vocaloid phonetic symbols using the built-in pronunciation dictionary.[3] The user can directly edit the phonetic symbols of unregistered words.[13] The Score Editor offers various parameters to add expressions to singing voices. The user is supposed to optimize these parameters that best fit the synthesized tune when creating voices.[12] This editor supports ReWire and can be synchronized with DAW. Real-time "playback" of songs with predefined lyrics using a MIDI keyboard is also supported.[3]

Singer Library

Each Vocaloid license develops the Singer Library, or a database of vocal fragments sampled from real people. The database must have all possible combinations of phonemes of the target language, including diphones (a chain of two different phonemes) and sustained vowels, as well as polyphones with more than two phonemes if necessary.[3] For example, the voice corresponding to the word "sing" ([sIN]) can be synthesized by concatenating the sequence of diphones "#-s, s-I, I-N, N-#" (# indicating a voiceless phoneme) with the sustained vowel ī.[17] The Vocaloid system changes the pitch of these fragments so that it fits the melody. In order to get more natural sounds, three or four different pitch ranges are required to be stored into the library.[19][20] Japanese requires 500 diphones per pitch, whereas English requires 2,500.[17] Japanese has fewer diphones because it has fewer phonemes and most syllabic sounds are open syllables ending in a vowel. In Japanese, there are basically three patterns of diphones containing a consonant: voiceless-consonant, vowel-consonant, and consonant-vowel. On the other hand, English has many closed syllables ending in a consonant, and consonant-consonant and consonant-voiceless diphones as well. Thus, more diphones need to be recorded into an English library than into a Japanese one. Due to this linguistic difference, a Japanese library is not suitable for singing in eloquent English.[citation needed]

Synthesis Engine

Vocaloid Synthesis Engine[21]

The Synthesis Engine receives score information contained in dedicated MIDI messages called Vocaloid MIDI sent by the Score Editor, adjusts pitch and timbre of the selected samples in frequency domain, and splices them to synthesize singing voices.[3] When Vocaloid runs as VSTi accessible from DAW, the bundled VST plug-in bypasses the Score Editor and directly sends these messages to the Synthesis Engine.[13] Pitch conversion Since the samples are recorded in different pitches, pitch conversion is required when concatenating the samples.[3] The engine calculates a desired pitch from the notes, attack time, and vibrato parameters, and then selects the necessary samples from the library.[13] Timing adjustment In singing voices, the consonant onset of a syllable is uttered before the vowel onset is uttered. The starting position of a note ("Note-On") must be the same as that of the vowel onset, not the start of the syllable. Vocaloid keeps the "synthesized score" in memory to adjust sample timing so that the vowel onset should be strictly on the "Note-On" position.[13] No timing adjustment would result in delay. Sample Concatenation

Spectral envelope interpolation between samples

Spectral Peak Processing (SPP) for timbre manipulation (based on Fig.3 on Bonada & Loscos 2003) When concatenating the processed samples, discontinuities are reduced by spreading the phase between samples via phase correction and estimating spectral shape using a source-filter model called the Excitation plus Resonances (EpR) model.[3] Timbre manipulation The engine smooths the timbre around the junction of the samples. The timbre of a sustained vowel is generated by interpolating spectral envelopes of the surrounding samples. For example, when concatenating a sequence of diphones "s-e, e, e-t" of the English word "set", the spectral envelope of a sustained ē at each frame is generated by interpolating ē in the end of "s-e" and ē in the beginning of "e-t".[3] Transforms After pitch conversion and timbre manipulation, the engine does transforms such as Inverse Fast Fourier transform (IFFT) to output synthesized voices.[3]

Software history

See also: List of Vocaloid products

Screenshot of the software interface for Vocaloid

4:00

"Freely Tomorrow" by Mitchie M

A song with vocals provided by the Vocaloid character Hatsune Miku.

Vocaloid

Main article: Vocaloid (software)

Yamaha started development of Vocaloid in March 2000[17] and announced it for the first time at the German fair Musikmesse on March 5–9, 2003.[22] It was created under the name "Daisy", in reference to the song "Daisy Bell", but for copyright reasons this name was dropped in favor of "Vocaloid".[23]

Vocaloid 2

Main article: Vocaloid 2

Vocaloid 2 was announced in 2007. Unlike the first engine, Vocaloid 2 based its results on vocal samples, rather than analysis of the human voice.[24] The synthesis engine and the user interface were completely revamped, with Japanese Vocaloids possessing a Japanese interface.[12]

Vocaloid 3

Main article: Vocaloid 3

Vocaloid 3 launched on October 21, 2011, along with several products in Japanese, the first of its kind. Several studios updated their Vocaloid 2 products for use with the new engine with improved voice samples.[25]

Vocaloid 4

Main article: Vocaloid 4

In October 2014, the first product confirmed for the Vocaloid 4 engine was the English vocal Ruby, whose release was delayed so she could be released on the newer engine. In 2015, several V4 versions of Vocaloids were released.[26] The Vocaloid 5 engine was then announced soon afterwards.

Vocaloid 5

Main article: Vocaloid 5

Vocaloid 5 was released on July 12, 2018,[27] with an overhauled user interface and substantial engine improvements. The product is only available as a bundle; the standard version includes four voices and the premium version includes eight.[28] This is the first time since Vocaloid 2 that a Vocaloid engine has been sold with vocals, as they were previously sold separately starting with Vocaloid 3.

Vocaloid 6

Vocaloid 6 was released on October 13, 2022, with support for previous voices from Vocaloid 3 and later, and a new line of Vocaloid voices on their own engine within Vocaloid 6 known as Vocaloid:AI. The product is only sold as a bundle, and the standard version includes the 4 voices included with Vocaloid 5, as well as 4 new voices from the Vocaloid:AI line. Vocaloid 6's AI voicebanks support English and Japanese by default, though Yamaha announced they intended to add support for Chinese. Vocaloid 6 also includes a feature where a user can import audio of themselves singing and have Vocaloid:AI recreate that audio with one of its vocals.[29]

#long post #i tagged as long post you cant be mad at me

9 notes · View notes

linguistlist-blog · 6 months

Text

Calls: Young Female* Researchers in Speech Workshop 2024

Call for Papers: YFRSW 2024 is the eighth of its kind, after a successful inaugural event YFRSW 2016 in San Francisco/USA, followed by YFRSW 2017 in Stockholm/Sweden, YFRSW 2018 in Hyderabad/India, YFRSW 2019 in Graz/Austria, a virtual YFRSW 2021, YFRSW 2022 in Incheon/South Korea, and YFRSW 2023 in Dublin/Ireland. (No event was held in 2020 due to Covid-19.) Students who are accepted to the workshop will receive a grant to pay for their student Interspeech registration (at the early-bird stud http://dlvr.it/T4r4hQ

0 notes

trustclassifieds · 2 years

Text

Optical expressions

#OPTICAL EXPRESSIONS SOFTWARE#

Optical Expressions 2425 E Camelback Rd Phoenix AZ 85016. He was also invited to give a Tutorial at an International Summer School on Deep Learning (DeepLearn 2017). We are an eye care private practice committed to meeting all of your vision needs, serving both at our Creve Coeur and Clayton locations. Get directions, reviews and information for Optical Expressions in Phoenix, AZ. He delivered conference tutorials at major conferences, including: IEEE Computer Vision and Pattern Recognition (CVPR 2016), Interspeech 2014, IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) and European Conference on Computer Vision (ECCV). Wilken values the personal interaction with patients and strives to. Application of computer vision to track changes in human facial expressions during long-duration spaceflight may be a useful way to unobtrusively detect the presence of stress during critical operations. He won the Best Supervisor of the Year Award at QUT (1998), and received award for research supervision at UWA (2008 & 2016) and Vice-Chancellor Award for mentorship (2016). Erica Wilken is an optometrist treating patients at Optical Expressions, Inc. Perfect for animation production and integration into FACS pipelines. The drawback of using this system is the high costs associated with maintaining these cells for protein expression. Marker-based facial motion capture software. He successfully supervised 30 + PhD students to completion. Protein expression in mammalian cells is increasingly becoming the system of choice for studying proteins, as it ensures protein folding and glycosylation patterns like those found physiologically. He was awarded 65 + competitive research grants, from the Australian Research Council, and numerous other Government, UWA and industry Research Grants. His h-index is 62 and his number of citations is 17,000+ (Google Scholar). Located at 548 Page Blvd, Optical Expressions can be reached at +1 41. We handle everything from children and adult eye exams and treatment of eye diseases, to eye. He has published four books (available on Amazon), one edited book, one Encyclopedia article, 14 book chapters, more than 160 journal papers, more than 270 conference publications, 16 invited & keynote publications. After a comprehensive evaluation of your eye, Optical Expressions can recommend the contact lenses that are best for you including lenses from CooperVision and other manufacturers. Our team of eye doctors and experienced staff is here for you.

#OPTICAL EXPRESSIONS SOFTWARE#

Mohammed Bennamoun is Winthrop professor at the Department of Computer Science and Software Engineering at UWA and, a researcher in computer vision, machine/deep learning, robotics, and signal/speech processing. We are a Canadian Distributor of high quality optical products with an original and yet contemporary touch.

#Optical expressions

0 notes

sciforce · 5 years

Text

Our Expectations from INTERSPEECH 2019

In less than a month, from Sep. 15–19, 2019, Graz, Austria will become home for INTERSPEECH, the world‘s most prominent conference on spoken language processing. The conference unites the science and technology under one roof and becomes a platform for over 2000 participants who will share their insights, listen to eminent speakers, and attend tutorials, challenges, exhibitions and satellite events.

What are our expectations of it as participants and presenters?

Keynotes

Tanja Schultz, the spokesperson of the University Bremen area “Minds, Media, Machines”, will talk on biosignal processing for human-machine interaction. As human interaction involves a wide range of biosignals from speech, gestures, motion, and brain activities, it is crucial to correctly interpret all of them to ensure truly effective human-machine interaction. We are waiting for Tanja Schultz to describe her work on Silent Speech Interfaces that rely on articulatory muscle movement to recognize and synthesize silently produced speech, Brain Computer Interfaces that use brain activity to recognize speech and convert electrocortical signals into audible speech. Let’s move the the new era of brain-to-text and brain-to-speech technology!

Manfred Kaltenbacher of Vienna University of Technology will discuss physiology and physics of voice production. This talk has a more medical slant, as it looks at voice production from the point of view of physiology and physics. At the same time, it will discuss current computer simulations for pre-surgical predictions of voice quality, and development of examination and training of voice professionals — an interesting step from usual technology-oriented talks.

Mirella Lapata, Professor of natural language processing in the School of Informatics at the University of Edinburgh, will talk about learning natural language interfaces with neural models. Back to technology and AI, the talk will address the structured prediction problem of mapping natural language onto machine-interpretable representations. We definitely think that it will be useful for any NLP specialist to know more about a neural network-based general modeling framework — the most promising approach of recent years.

Tutorials

There are eight of them and we love them all! The tutorials tackle diverse topics, but they all discuss the most interesting recent developments and breakthroughs.

Two tutorials concern Generative adversarial networks, showing once again the power of this approach. The tutorial we are going to is offered by National Taiwan University and Academia Sinica. It is dedicated to speech signal processing, including speech enhancement, voice conversion, speech synthesis and, more specifically, sentence generation. Moreover, we can expect real-life GAN algorithms for text style transformation, machine translation and abstractive summarization without paired data.

The second tutorial by Carnegie Mellon University and Bar-Ilan University shows how GANs can be used for speech and speaker recognition and other systems. The tutorial will discuss whether it is possible to fool systems with carefully crafted inputs and how to identify and avoid attacks of such crafted “adversarial” inputs. Finally, we will discuss recent work on introducing “backdoors” into systems through poisoned training examples, such that the system can be triggered into false behaviors when provided specific types of inputs, but not otherwise.

We are also waiting for the tutorial on another popular technique in speech processing, the end-to-end approach. We are expecting from the tutorial by Mitsubishi, Nagoya University, NTT and Johns Hopkins University some interesting insights into advanced methods for neural end-to-end speech processing, namely, unification, integration, and implementation. The tutorial will explore a new open source toolkit ESPnet (end-to-end speech processing toolkit) used on the unified framework and integrated systems.

Nagoya University specialists will offer a tutorial on statistical voice conversion with direct waveform modelling. The tutorial will give an overview of this approach and introduce freely-available software, “sprocket” as a statistical VC toolkit and “PytorchWaveNetVocoder” as a neural vocoder toolkit. It looks like a good chance to try your hand on voice conversion.

Google AI is preparing a tutorial on another hot topic, neural machine translation. However, the tutorial looks more like an overview of the history, mainstream techniques and recent advancements.

Expanding the keynote speech, specialists from Université Grenoble Alpes and Maastricht University will present biosignal-based speech processing, including silent speech to brain-computer interfaces with real data and code.

Uber AI will present their approach to modelling and deploying dialog systems with open-source tools from scratch.

Finally, Paderborn University and NTT will present their insights into microphone array signal processing and deep learning for speech enhancement with hybrid techniques uniting signal processing and neural networks.

Special events and challenges

Special sessions and challenges constitute a separate part of the conference and focus on relevant ‘special’ topics, ranging from computational paralinguistics, distant speech recognition and zero resource speech processing to processing of child’s speech, emotional speech and code switching. All papers are already submitted, but it will be very interesting to see the finals and discuss the winners’ approaches.

Besides from challenges, there will be much more special events and satellite conferences to meet the needs of all specialists working in the field of speech processing: from a workshop for young female researchers to a special event for high school teachers. Participant will be able to join the first ever INTERSPEECH Hackathon, or choose between nine specialized conferences and satellite workshops.

SLaTE

A special event that is the most important for us is the workshop held by the Special Interest Group (SIG) of the International Speech Communication Association (ISCA) as a part of Speech and Language Technology in Education (SLaTE) events. The event brings together practitioners and researchers working on the use of speech and natural language processing for education. This year’s workshop will not only have an extensive general session with 19 papers, but also it will feature a special session about the Spoken CALL Shared Task (version 3) with 4 papers, and 4 demo papers.

Our poster

Our biggest expectation is, of course, our participation! The article written by our stellar specialists Ievgen Karaulov and Dmytro Tkanov entitled “Attention model for articulatory features detection” was approved for a poster session. Our approach is a variation of end-to-end speech processing. The article proves that using binary phonological features in the Listen, Attend and Spell (LAS) architecture can show good results for phones recognition even on a small training set like TIMIT. More specifically, the attention model is used to train manners and places of articulation detectors end-to-end and to explore joint phones recognition and articulatory features detection in multitask learning settings.

Our SLaTE paper

Yes, we present not just one paper! Since our solution showed the best result in the text subset of CALL v3 shared task, we wrote a paper exploring our approach and now we are going to present it at SLaTE. The paper called “Embedding-based system for the text part of CALL v3 shared task” by four our team members (Volodymyr Sokhatskyi, Olga Zvyeryeva, Ievgen Karaulov, and Dmytro Tkanov) focuses on NNLM and BERT text embeddings and their use in a scoring system to measure grammatical and semantic correctness of students’ phrases. Our approach does not rely on the reference grammar file for scoring proving the possibility to achieve highest results without a predefined set of correct answers.

Even now we already anticipate this INTERSPEECH to be a gala event that will give us more knowledge, ideas and inspiration — and a great adventure for our teammates.

#machine learning #NLP #speech processing #interspeech

0 notes

instadw · 5 years

Link

#IMPORTANT DATES – INTERSPEECH 2020

0 notes

iyoopon · 3 years

Text

0 notes

linguistlist-blog · 1 year

Text

FYI: August 2023 Newsletter - LDC

In this newsletter: LDC at Interspeech 2023 LDC releases speech activity detector Fall 2023 LDC Data Scholarship Program New publications: 2019 OpenSAT Public Safety Communications Simulation Samrómur Queries Icelandic Speech 1.0 ________________________________________ LDC at Interspeech 2023 LDC is happy to be back in person as an exhibitor and longtime supporter of Interspeech, taking place this year August 20-24 in Dublin, Ireland. Stop by Stand A2 to say hello and learn about the latest d http://dlvr.it/StlcQ9

0 notes

taimoorzaheer · 3 years

Text

NVIDIA Shares Speech Synthesis Research at Interspeech | NVIDIA Blog

AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers. But there’s still a gap between AI-synthesized speech and the human speech we hear in daily conversation and in the media. That’s because people speak with complex rhythm, intonation and timbre that’s challenging…

View On WordPress

0 notes

cloudtales · 4 years

Text

Towards Computer-Based Automated Screening of Dementia Through Spontaneous Speech

Dementia, a prevalent disorder of the brain, has negative effects on individuals and society. This paper concerns using Spontaneous Speech (ADReSS) Challenge of Interspeech 2020 to classify Alzheimer’s dementia. We used (1) VGGish, a deep, pretrained, Tensorflow model as an audio feature extractor, and Scikit-learn classifiers to detect signs of dementia in speech. Three classifiers (LinearSVM,…

View On WordPress

0 notes

machinelistening · 4 years

Photo

“INTERSPEECH has grown into the world's largest technical conference focused on speech processing and application with over 1000 attendees and over 600 papers. The conferences emphasize interdisciplinary approaches addressing all aspects of speech science and technology, ranging from basic theories to advanced applications.” (via About the Conference - INTERSPEECH 2020)

0 notes

linguistlist-blog · 1 year

Text

Summer Schools, Young Female Researchers in Speech Workshop (+ Interspeech) / Ireland

ICYMI: The Young Female Researchers in Speech Workshop (YFRSW) is a workshop for female-identifying Bachelor's and Master’s students currently working in speech science and technology. It is designed to foster interest in research in our field among women who have not yet decided to pursue a PhD in speech science or technology areas, but who have gained some research experience at their university through individual or group projects. http://dlvr.it/SmvlZK

0 notes

un-enfant-immature · 5 years

Text

Google details AI work behind Project Euphonia’s more inclusive speech recognition

As part of new efforts towards accessibility, Google announced Project Euphonia at I/O in May: An attempt to make speech recognition capable of understanding people with non-standard speaking voices or impediments. The company has just published a post and its paper explaining some of the AI work enabling the new capability.

The problem is simple to observe: The speaking voices of those with motor impairments, such as those produced by degenerative diseases like amyotrophic lateral sclerosis (ALS), simply are not understood by existing natural language processing systems.

You can see it in action in the following video of Google research scientist Dimitri Kanevsky, who himself has impaired speech, attempting to interact with one of the company’s own products (and eventually doing so with the help of related work Parrotron):

youtube

The research team describes it as following:

ASR [automatic speech recognition] systems are most often trained from ‘typical’ speech, which means that underrepresented groups, such as those with speech impairments or heavy accents, don’t experience the same degree of utility.

…Current state-of-the-art ASR models can yield high word error rates (WER) for speakers with only a moderate speech impairment from ALS, effectively barring access to ASR reliant technologies.

It’s notable that they at least partly blame the training set. That’s one of those implicit biases we find in AI models that can lead to high error rates in other places, like facial recognition or even noticing that a person is present. While failing to include major groups like people with dark skin isn’t a mistake comparable in scale to building a system not inclusive of those with impacted speech, they can both be addressed by more inclusive source data.

For Google’s researchers, that meant collecting dozens of hours of spoken audio from people with ALS. As you might expect, each person is affected differently by their condition, so accommodating the effects of the disease is not the same process as accommodating, say, a merely uncommon accent.

Live transcription and captioning in Android are a boon to the hearing-impaired

A standard voice-recognition model was used as a baseline, then tweaked in a few experimental ways, training it on the new audio. This alone reduced word error rates drastically, and did so with relatively little change to the original model, meaning there’s less need for heavy computation when adjusting to a new voice.

The researchers found that the model, when it is still confused by a given phoneme (that’s an individual speech sound like an e or f), has two kinds of errors. First, there’s the fact that it doesn’t recognize the phoneme for what was intended, and thus not recognizing the word. And second, the model has to guess at what phoneme the speaker did intend, and might choose the wrong one in cases where two or more words sound roughly similar.

The second error in particular is one that can be handled intelligently. Perhaps you say “I’m going back inside the house,” and the system fails to recognize the “b” in back and the “h” in house; it’s not equally likely that you intended to say “I’m going tack inside the mouse.” The AI system may be able to use what it knows of human language — and of your own voice or the contest in which you’re speaking — to fill in the gaps intelligently.

But that’s left to future research. For now you can read the team’s work so far in the paper “Personalizing ASR for Dysarthric and Accented Speech with Limited Data,” due to be presented at the Interspeech conference in Austria next month.

#TechCrunch

0 notes

journalgen · 7 years

Text

Matsuyama Journal of Dirty Onomastic Phonotoxicology, Design, and Interspeech

#academic journal bot #academia #journal #bot #tracery #cbdq

0 notes

tastydregs · 7 years

Text

Mind-reading machine can guess the magic number

Scientists have made some hugely promising advances when it comes to brain-computer interfaces in recent times, with machines that can turn thoughts into controls for drones, wheelchairs and music. Now, researchers in Japan say that they've broken new ground with a technology that can recognize Japanese words and also guess the single-digit number on a subject's mind with 90 percent accuracy, simply by monitoring their brainwaves.

The system was developed by researchers at Japan's Toyohashi University of Technology and uses an electroencephalogram (EEG) cap. These types of headsets use electrodes placed on the scalp to monitor electrical signals coming from the brain, and though they could have wide-ranging applications, one area in which they seem particularly promising is improving the lives of handicapped people.

Sick of Ads?

More than 700 New Atlas Plus subscribers read our newsletter and website without ads.

Join them for just US$19 a year.

More Information

In 2010, for example, we saw a home-based EEG cap system that enabled users with locked-in syndrome (where the brain is active but the body is not) to communicate by selecting characters on a virtual keyboard simply by concentrating on them. In the same year, researchers at the University of Utah showed off an implantable (but not brain-penetrating) version that translated brain signals into a limited number of words.

The Japanese researchers too hold hopes of one day allowing people without the ability to speak to communicate. They say that turning EEG signals into words has previously been limited by the amount of data that these systems can collect, but its new approach to interpreting these signals, which it says is based on holistic pattern recognition and machine learning, allows it to achieve "high performance" with just a small data set.

So much so, that the system can recognize uttered numbers between 0 and 9 with 90 percent accuracy. And perhaps more promisingly, the system recognized 18 single-syllable Japanese words with 61 percent accuracy, raising the prospect of a brain-activated typewriter.

With further development, the team imagines that its system will not only help handicapped people, but act as a more seamless computer interface for healthy people, too. It says it plans to refine the technology and develop a device that can plug into smartphones within five years. It will present its progress at the Interspeech conference in Sweden this August.

Source: Toyohashi University of Technology

View gallery - 3 images

0 notes

gozealouscloudcollection · 5 years

Text

谷歌語音識別新進展：利用序列轉導來實現多人語音識別和說話人分類

例如，在一段醫生和患者的對話中，醫生問：“你按時服用心髒病藥物了嗎？”患回答道：“Yes”。這與醫生反問患者“Yes？”的意義是有本質區別的。

傳統的說話人分類（speaker diarization，SD）系統有兩個步驟。在第一步中，系統將檢測聲譜中的變化，從而確定在一段對話中，說話人甚麼時候改變了；在第二步中，系統將識別出整段對話中的各個說話人。這種基礎的多步方法（相關閱讀：https://ieeexplore.ieee.org/document/1202280/）幾乎已經被使用了 20 多年，而在麼長的時間內，研究者們僅僅在“說話人變化檢測”部分提升了模型性能。

近年來，隨著一種名為遞歸神經網絡變換器（RNN-T，https://arxiv.org/abs/1211.3711）的新型神經網絡模型的發展，我們現在擁有了一種合適的架構，它可以克服之前我們介紹過的說話人分類系統（https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html）的局限性，提升系統的性能。在谷歌最近發布的論文“Joint Speech Recognition and Speaker Diarization via Sequence Transduction”（論文地址：https://arxiv.org/abs/1907.05337）中，它們提出了一種基於RNN-T 的說話人分類系統，證明了該系統在單詞分類誤差率從20 % 降低到了2%（性能提升了10 倍），該工作將在Interspeech 2019 上展示。

傳統的說話人分類系統

傳統的說話人分類系統依賴於人聲的聲學差異識別出對話中不同的說話人。根據男人和女人的音高，僅僅使用簡單的聲學模型（例如，混合高斯模型），就可以在一步中相對容易地將他們區分開來。然而，想要區分處音高可能相近的說話者，說話者分類系統就需要使用多步方法了。首先，基於檢測到的人聲特徵，使用一個變化檢測算法將對話切分成均勻的片段，我們希望每段僅僅包含一個說話人。接著，使用一個深度學習模型將上述說話人的聲音片段映射到一個嵌入向量上。最後，在聚類階段，會對上述嵌入聚類在不同的簇中，追踪對話中的同一個說話人。

在真實場景下，說話人分類系統與聲學語音識別（ASR）系統會並行化運行，這兩個系統的輸出將會被結合，從而為識別出的單詞分配標籤。

傳統的說話人分類系統在聲學域中進行推斷，然後將說話人標籤覆蓋在由獨立的 ASR 系統生成的單詞上。

這種方法存在很多不足，阻礙了該領域的發展：

（1）我們需要將對話切分成僅僅包含以為說話人的語音的片段。否則，根據這些片段生成的嵌入就不能準確地表徵說話人的聲學特徵。然而，實際上，這裡用到的變化檢測算法並不是十全十美的，會導致分割出的片段可能包含多位說話人的語音。

（2）聚類階段要求說話人的數量已知，並且這一階段對於輸入的準確性十分敏感。

（3）系統需要在用於估計人聲特徵的片段大小和期望的模型準確率之間做出艱難的權衡。片段越長，人聲特徵的質量就越高，因為此時模型擁有更多關於說話人的信息。這然而，這就帶來了將較短的插入語分配給錯誤的說話人的風險。這將產生非常嚴重的後果，例如，在處理臨床醫學或金融領域的對話的環境下，我們需要準確地追踪肯定和否定的陳述。

（4）傳統的說話人分類系統並沒有一套方便的機制，從而利用在許多自然對話中非藏明顯的語言學線索。例如，“你多久服一次藥？”在臨床對話中最有可能是醫護人員說的，而不會是病人說的。類似地，“我們應該什麼時候上交作業？”則最有可能是學生說的，而不是老師說的。語言學的線索也標誌著說話人有很高的概率發生了改變（例如，在一個問句之後）。

然而，傳統的說話人分類系統也有一些性能較好的例子，在谷歌此前發布的一篇博文中就介紹了其中之一（博文地址：https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html）。在此工作中，循環神經網絡（RNN）的隱藏狀態會追踪說話人，克服了聚類階段的缺點。而本文提出的模型則採用了不容的方法，引入了語言學線索。

集成的語音識別和說話人分類系統

我們研發出了一種簡單的新型模型，該模型不僅完美地融合了聲學和語音線索，而且將說話人分類和語音識別任務融合在了同一個系統中。相較於相同環境下僅僅進行語音識別的系統相比，這個集成模型並沒有顯著降低語音識別性能。

我們意識到，很關鍵的一點是：RNN-T 架構非常適用於集成聲學和語言學線索。 RNN-T 模型由三個不同的網絡組成：（1）轉錄網絡（或稱編碼器），將聲幀映射到一個潛在表徵上。（2）預測網絡，在給定先前的目標標籤的情況下，預測下一個目標標籤。（3）級聯網絡，融合上述兩個網絡的輸出，並在該時間步生成這組輸出標籤的概率分佈。

請注意，在下圖所示的架構中存在一個反饋循環，其中先前識別出的單詞會被作為輸入返回給模型，這使得 RNN-T 模型能夠引入語言學線索（例如，問題的結尾）。

集成的語音識別和說話人分類系統示意圖，該系統同時推斷“誰，在何時，說了什麼”

在圖形處理單元（GPU）或張量處理單元（TPU）這樣的加速器上訓練RNN-T 並不是一件容易的事，這是因為損失函數的計算需要運行“前向推導-反向傳播”算法，該過程涉及到所有可能的輸入和輸出序列的對齊。最近，該問題在一種對 TPU 友好的“前向-後向”算法中得到了解決，它將該問題重新定義為一個矩陣乘法的序列。我們還利用了TensorFlow 平台中的一個高效的 RNN-T 損失的實現，這使得模型開發可以迅速地進行迭代，從而訓練了一個非常深的網絡。

這個集成模型可以直接像一個語音識別模型一樣訓練。訓練使用的參考譯文包含說話人所說的單詞，以及緊隨其後的指定說話人角色的標籤。例如，“作業的截止日期是什麼時候？”，“我希望你們在明天上課之前上交作業”。當模型根據音頻和相應的參考譯文樣本訓練好之後，用戶可以輸入對話記錄，然後得到形式相似的輸出結果。我們的分析說明，RNN-T 系統上的改進會影響到所有類型的誤差率（包括較快的說話者轉換，單詞邊界的切分，在存在語音覆蓋的情況下錯誤的說話者對齊，以及較差的音頻質量）。此外，相較於傳統的系統，RNN-T 系統展現出了一致的性能，以每段對話的平均誤差作為評價指標時，方差有明顯的降低。

傳統系統和 RNN-T 系統錯誤率的對比，由人類標註者進行分類。

此外，該集成模型還可以預測其它一些標籤，這些標籤對於生成對讀者更加友好的 ASR 譯文是必需的。例如，我們已經可以使用匹配好的訓練數據，通過標點符號和大小寫標誌，提升譯文質量。相較於我們之前的模型（單獨訓練，並作為一個 ASR ��後處理步驟），我們的輸出在標點符號和大小寫上的誤差更小。

現在，該模型已經成為了我們理解醫療對話的項目（https://ai.googleblog.com/2017/11/understanding-medical-conversations.html）中的一個標準模塊，並且可以在我們的非醫療語音服務中被廣泛採用。

Via https://ai.googleblog.com/2019/08/joint-speech-recognition-and-speaker.html

from 谷歌語音識別新進展：利用序列轉導來實現多人語音識別和說話人分類 via KKNEWS

#業內資訊 #cnBeta #Google 谷歌 #谷歌語音識別新進展：利用序列轉導來實現多人語音識別和說話人分類 #谷歌語音識別新進展利用序列轉導來實現多人語音識別和說話人分類

0 notes