MultiBooth: Towards Generating All Your Concepts in an Image from Text
			Paper
			•
			2404.14239
			•
			Published
				
			•
				
				9
			
Multi-modal Concept Extraction
Note The QFormer encoder E has three types of inputs: visual embeddings ξ of an image, text description l, and learnable query tokens W = [w1, · · · , wK ] where K is the number of query tokens. The outputs of QFormer are tokens O = [o1, · · · , oK ] with the same dimensions as the input query tokens.