Skip to content
  • Computer & Technology
  • SEO
  • Technology
  • About Us
    • Contact Us
    • Advertise Here
    • Disclosure Policy
    • Sitemap
  • Computer Network

How DALL-E 2 could solve major computer vision challenges

April 21, 2022
evan
0 Comments

Table of Contents

  • Computer vision’s shortcomings
  • Enter DALL-E 2
  • Automating dataset creation using GPT-3 + DALL-E
  • Limitations and mitigations
  • Final words
    • DataDecisionMakers


We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!


OpenAI has recently released DALL-E 2, a more advanced version of DALL-E, an ingenious multimodal AI capable of generating images purely based on text descriptions. DALL-E 2 does that by employing advanced deep learning techniques that improve the quality and resolution of the generated images and provides further capabilities such as editing an existing image, or creating new versions of it.

Related Posts:

  • 36 Tips Every Evernote User Must Know

Many AI enthusiasts and researchers tweeted about how amazing DALL-E 2 is at generating art and images out of a thin word, yet in this article I’d like to explore a different application for this powerful text-to-image model — generating datasets to solve computer vision’s biggest challenges.

Caption: A DALL-E 2 generated image. “A rabbit detective sitting on a park bench and reading a newspaper in a Victorian setting.” Source: Twitter

Computer vision’s shortcomings

Computer vision AI applications can vary from detecting benign tumors in CT scans to enabling self-driving cars. Yet what is common to all is the need for abundant data. One of the most prominent performance predictors of a deep learning algorithm is the size of the underlying dataset it was trained on. For example, the JFT dataset, which is an internal Google dataset used for the training of image classification models, consists of 300 million images and more than 375 million labels.

Consider how an image classification model works: A neural network transforms pixel colors into a set of numbers that represent its features, also known as the “embedding” of an input. Those features are then mapped to the output layer, which contains a probability score for each class of images the model is supposed to detect. During training, the neural network tries to learn the best feature representations that discriminate between the classes, e.g. a pointy ear feature for a Dobermann vs. a Poodle.

Ideally, the machine learning model would learn to generalize across different lighting conditions, angles, and background environments. Yet more often than not, deep learning models learn the wrong representations. For example, a neural network might deduce that blue pixels are a feature of the “frisbee” class because all the images of a frisbee it has seen during training were on the beach.

One promising way of solving such shortcomings is to increase the size of the training set, e.g. by adding more pictures of frisbees with different backgrounds. Yet this exercise can prove to be a costly and lengthy endeavor. 

First, you would need to collect all the required samples, e.g. by searching online or by capturing new images. Then, you would need to ensure each class has enough labels to prevent the model from overfitting or underfitting to some. Lastly, you would need to label each image, stating which image corresponds to which class. In a world where more data translates into a better-performing model, these three steps act as a bottleneck for achieving state-of-the-art performance.

But even then, computer vision models are easily fooled, especially if they are being attacked with adversarial examples. Guess what is another way to mitigate adversarial attacks? You guessed right — more labeled, well-curated, and diverse data.

Caption: OpenAI’s CLIP wrongly classified an apple as an iPod due to a textual label. Source: OpenAI

Enter DALL-E 2

Let’s take an example of a dog breed classifier and a class for which it is a bit harder to find images — Dalmatian dogs. Can we use DALL-E to solve our lack-of-data problem?

Consider applying the following techniques, all powered by DALL-E 2:

  • Vanilla use. Feed the class name as part of a textual prompt to DALL-E and add the generated images to that class’s labels. For example, “A Dalmatian dog in the park chasing a bird.”
  • Different environments and styles. To improve the model’s ability to generalize, use prompts with different environments while maintaining the same class. For example, “A Dalmatian dog on the beach chasing a bird.” The same applies to the style of the generated image, e.g. “A Dalmatian dog in the park chasing a bird in the style of a cartoon.”
  • Adversarial samples. Use the class name to create a dataset of adversarial examples. For instance, “A Dalmatian-like car.”
  • Variations. One of DALL-E’s new features is the ability to generate multiple variations of an input image. It can also take a second image and fuse the two by combining the most prominent aspects of each. One can then write a script that feeds all of the dataset’s existing images to generate dozens of variations per class.
  • Inpainting. DALL-E 2 can also make realistic edits to existing images, adding and removing elements while taking shadows, reflections, and textures into account. This can be a strong data augmentation technique to further train and enhance the underlying model.

Except for generating more training data, the huge benefit from all of the above techniques is that the newly generated images are already labeled, removing the need for a human labeling workforce.

While image generating techniques such as generative adversarial networks (GAN) have been around for quite some time, DALL-E 2 differentiates in its 1024×1024 high-resolution generations, its multimodality nature of turning text into images, and its strong semantic consistency, i.e. understanding the relationship between different objects in a given image.

Automating dataset creation using GPT-3 + DALL-E

DALL-E’s input is a textual prompt of the image we wish to generate. We can leverage GPT-3, a text generating model, to generate dozens of textual prompts per class that will then be fed into DALL-E, which in turn will create dozens of images that will be stored per class.

For example, we could generate prompts that include different environments for which we would like DALL-E to create images of dogs.

Caption: A GPT-3 generated prompt to be used as input to DALL-E . Source: author

Using this example, and a template-like sentence such as “A [class_name] [gpt3_generated_actions],” we could feed DALL-E with the following prompt: “A Dalmatian laying down on the floor.” This can be further optimized by fine-tuning GPT-3 to produce dataset captions such as the one in the OpenAI Playground example above.

To further increase confidence in the newly added samples, one can set a certainty threshold to select only the generations that have passed a specific ranking, as every generated image is being ranked by an image-to-text model called CLIP.

Limitations and mitigations

If not used carefully, DALL-E can generate inaccurate images or ones of a narrow scope, excluding specific ethnic groups or disregarding traits that might lead to bias. A simple example would be a face detector that was only trained on images of men. Moreover, using images generated by DALL-E might hold a significant risk in specific domains such as pathology or self-driving cars, where the cost of a false negative is extreme.

DALL-E 2 still has some limitations, with compositionality being one of them. Relying on prompts that, for example, assume the correct positioning of objects might be risky.

Caption: DALL-E still struggles with some prompts. Source: Twitter

Ways to mitigate this include human sampling, where a human expert will randomly select samples to check for their validity. To optimize such a process, one can follow an active-learning approach where images that got the lowest CLIP ranking for a given caption are prioritized for a review.

Final words

DALL-E 2 is yet another exciting research result from OpenAI that opens the door to new kinds of applications. Generating huge datasets to address one of computer vision’s biggest bottlenecks–data is just one example.

OpenAI signals it will release DALL-E sometime during this upcoming summer, most likely in a phased release with a pre-screening for interested users. Those who can’t wait, or who are unable to pay for this service, can tinker with open source alternatives such as DALL-E Mini (Interface, Playground repository).

While the business case for many DALL-E-based applications will depend on the pricing and policy OpenAI sets for its API users, they are all certain to take image generation one big leap forward.

Sahar Mor has 13 years of engineering and product management experience focused on AI products. He is currently a Product Manager at Stripe, leading strategic data initiatives. Previously, he founded AirPaper, a document intelligence API powered by GPT-3 and was a founding Product Manager at Zeitgold (Acq. By Deel), a B2B AI accounting software company where he built and scaled its human-in-the-loop product, and Levity.ai, a no-code AutoML platform. He also worked as an engineering manager in early-stage startups and at the elite Israeli intelligence unit, 8200.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers





Source link

Biggest Science And Technology Expo Blair Technology Group Ebay Store Blockchain Technology In Nigeria Brockway Career And Technology Center Communication Technology For Ell Construction Management And Technology Articles Cost Of Airline Technology Innovation Curve Of Technology Expectation D S Technology Usa Dc Cbre Technology Elevate Technology Solutions Hampton Epoch Technology Consulting Contract Famous Ted In Technology Hao Huang Illinois Insttitue Technology Happy Diwali Technology Health Information Technology Across Departments Health Information Technology Professional Networking Holo Image Technology Joint Engine Technology Definition Latest End Mill Technology Medical Technology Site:Harvard.Edu Mental Helath Technology Minnesota Technology Innovation Institute Multimedia Technology Aiwa C6 Gps North Carolina Technology Council Performance Technology Trucking Canton Ohio Peripheral Devices Technology In Action Phase Technology Phase Velocity V62 Psprs Az Chief Technology Officer Rna-Seq Technology Steps San Francisco Technology Output Scientific Technology Wireline Secretly Harmful Technology Skylake Z170 Smart Response Technology Technology Addiction Support Group Technology And Healthcare Jobs Technology At Our Fingertips Technology Based On Nature Technology Book Bindings Manuscript Technology Career Fair Los Angeles Technology Data Entry Jobs Technology Impacting Early Literacy Technology In Education Program Technology Is Hurting Education 217 Technology Leakage Problems Technology Logos Man Hair What Is It Technology Solutions What Technology Does Belgium Have What Technology In 10 Years Youth Technology Leaders Of America

« Microsoft unifies large-scale data management under Purview framework
36 Tips Every Evernote User Must Know »
Sidebar

Recent Posts

  • SwitchBot Curtain Rod 2 review: This smart curtain controller gets a streamlined design
  • Dodge Co. cold case solved using new DNA technology sparks privacy concerns
  • Two key secondary players are back in Lubbock
  • Machine Learning Trends Impacting Businesses In 2022
  • Every free game on Netflix (May 2022)
Intellifluence Trusted Blogger

Archives

Categories

May 2022
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
3031  
« Apr    

BL

LP

TL

Visit Now

education jobs

play 

pixliv Digitally first class

Theme by The WP Club . Proudly powered by WordPress

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT