Skip to content
  • Computer & Technology
  • SEO
  • Technology
  • About Us
    • Contact Us
    • Advertise Here
    • Disclosure Policy
    • Sitemap
  • Computer Network

More public data key to democratizing ML, says MLCommons • The Register

April 18, 2022
evan
0 Comments

Table of Contents

  • Increasing understanding with the People’s Speech
  • A corpus to broaden the reach of digital assistants 


Unless you’re an English speaker, and one with as neutral an American accent as possible, you’ve probably butted heads with a digital assistant that couldn’t understand you. With any luck, a couple of open-source datasets from MLCommons could help future systems grok your voice.

The two datasets, which were made generally available in December, are the People’s Speech Dataset (PSD), a 30,000-hour database of spontaneous English speech; and the Multilingual Spoken Words Corpus (MSWC), a dataset of some 340,000 keywords in 50 languages. 

Related Posts:

  • 36 Tips Every Evernote User Must Know

By making both datasets publicly available under CC-BY and CC-BY-SA licenses, MLCommons hopes to democratize machine learning – that is to say, make it available to everyone – and help push the industry toward data-centric AI.

David Kanter, executive director and founder of MLCommons, told Nvidia in a podcast this week that he sees data-centric AI as a conceptual pivot from “which model is the most accurate,” to “what can we do with data to improve model accuracy.” For that, Kanter said, the world needs lots of data.

Increasing understanding with the People’s Speech

Spontaneous speech recognition is still challenging for AIs, and the PSD could help learning machines better understand colloquial speech, speech disorders and accents. Had a database like this existed earlier, said PSD project lead Daniel Galvez, “we’d likely be speaking to our digital assistants in a much less robotic way.” 

The 30,000 hours of speech in the People’s Speech Dataset was culled from a total of 50,000 hours of publicly available speech pulled from the Internet Archive digital library, and it has two unique qualities: Firstly, it’s entirely spontaneous speech, meaning it contains all the tics and imprecisions of the average conversation. Second, it all came with transcripts.

By using some CUDA-powered inference engine tricks, the team behind PSD was able to reduce labeling time of that massive dataset to just two days. The end result was a dataset that can allow chatbots and other speech recognition programs to better understand those with voices that differ from those of American English-speaking, white, males. 

Galvez said that speech disorders, neurological issues and accents are all poorly represented in datasets, and as a result, “[those types of speech] aren’t well understood by commercial products.”

Again, said Kanter, projects like those fail because of a lack of data that includes diverse speakers. 

A corpus to broaden the reach of digital assistants 

The Multilingual Spoken Words Corpus is a different animal from the PSD. Instead of complete sentences, the Corpus consists of 340,000 keywords in 50 languages. “To our knowledge this is the only open-source spoken word dataset for 46 of these 50 languages,” Kanter said. 

Digital assistants, like chatbots, are prone to bias based on their training datasets, which has led to them not catching on as quickly as they could have. Kanter predicts that digital assistants will be available worldwide “by mid-decade,” and he sees the MSWC as a key base for making that happen. 

“When you look at equivalent databases, it’s Mandarin, English, Spanish, and then it falls off pretty quick,” Kanter said. 

Kanter said the datasets were already tested by some of the MLCommons member companies. So far, he said they’re being used to de-noise audio and video recordings of crowded rooms and conferences, and for improving speech recognition. 

In the near future, Kanter said he hopes the datasets will be widely adopted and used alongside other public datasets that commonly serve as sources for ML and AI researchers. ®



Source link

2021 Acura Rdx Technology Package 2021 Acura Tlx Technology Package 2022 Acura Mdx Technology Package Align Technology Stock Applied Racing Technology Artificial Intelligence Technology Solutions Inc Assisted Reproductive Technology Battery Technology Stocks Benjamin Franklin Institute Of Technology Chief Technology Officer Color Star Technology Craft Design Technology Definition Of Technology Definitive Technology Speakers Element Materials Technology Health Information Technology Salary Ice Mortgage Technology Information Technology Definition Information Technology Degree Information Technology Salary Interactive Response Technology International Game Technology La Crosse Technology Weather Station Lacrosse Technology Atomic Clock Luokung Technology Stock Marvell Technology Stock Price Maytag Commercial Technology Washer Microchip Technology Stock Micron Technology Stock Price Mrna Technology History Mrna Vaccine Technology Nyc College Of Technology Penn College Of Technology Recombinant Dna Technology Rlx Technology Stock Robert Half Technology Science And Technology Sharif University Of Technology Smart Home Technology Stevens Institute Of Technology Ranking Symphony Technology Group Technology In The Classroom Technology Readiness Level Technology Stores Near Me Thaddeus Stevens College Of Technology University Of Advancing Technology Vanguard Information Technology Etf Vanguard Technology Etf What Is 5g Technology Women In Technology

« Hackaday Links: April 17, 2022
MRC SEO Consulting – Help You Achieve Your Online Marketing Goals »
Sidebar

Recent Posts

  • Mesh Wi-Fi Systems 101: The Best Tips
  • League City DNA tool helping to solve cold cases
  • ROG Rapture GT-AX6000 Router review – Is a non-mesh router worth $799.00?
  • 6 Tech Stocks for Bargain-Hunting Investors
  • Comparison of database architectures: data warehouse, data lake and data lakehouse
Intellifluence Trusted Blogger

Archives

Categories

May 2022
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
3031  
« Apr    

BL

LP

TL

Visit Now

business loans
pixliv Digitally first class

Theme by The WP Club . Proudly powered by WordPress

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT