The talk will cover generative modeling for multimodal input (image and text) in the context of product retrieval in fashion/e-commerce.
The presentation will include examples of applying generative (GAN) architectures for image generation with multimodal query using models derived from Conditional GAN, StackGAN, AttnGAN and others.
Retrieving products from large databases and finding items of particular interest for the user is a topic of ongoing research. Moving further from text search, tag based search and image search, there is still a lot of ambiguity when visual and textual features need to be merged. Text query might compliment an image ("I want sport shoes like these in the image, produced by XXX, wide fit and comfortable") or might represent a difference from image query ("I want a dress like that in the picture, only with shorter sleeves").
- Use cases in e-commerce and fashion
- Current methods for learning multimodal embedding (VSE, Multimodal Siamese Networks)
- Intro to GAN architectures that take latent representation as an input (we can influence what we generate, yeah!)
- How do you feed multimodal input into GAN
- Results and comparison