Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

2025/11/20 00:00
2 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.02949
$0.02949$0.02949
+1.40%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Tags:

You May Also Like

Trump’s Critical Warning: US Engages Iran’s New Regime in High-Stakes Talks, Threatens Strikes if Diplomacy Fails

Trump’s Critical Warning: US Engages Iran’s New Regime in High-Stakes Talks, Threatens Strikes if Diplomacy Fails

BitcoinWorld Trump’s Critical Warning: US Engages Iran’s New Regime in High-Stakes Talks, Threatens Strikes if Diplomacy Fails WASHINGTON, D.C. — March 15, 2025
Share
bitcoinworld2026/03/30 23:05
CME to launch Solana and XRP futures options on October 13, 2025

CME to launch Solana and XRP futures options on October 13, 2025

The post CME to launch Solana and XRP futures options on October 13, 2025 appeared on BitcoinEthereumNews.com. Key Takeaways CME Group will launch futures options for Solana (SOL) and XRP. The launch date is set for October 13, 2025. CME Group will launch futures options for Solana and XRP on October 13, 2025. The Chicago-based derivatives exchange will add the new crypto derivatives products to its existing digital asset offerings. The launch will provide institutional and retail traders with additional tools to hedge positions and speculate on price movements for both digital assets. The futures options will be based on CME’s existing Solana and XRP futures contracts. Trading will be conducted through CME Globex, the exchange’s electronic trading platform. Source: https://cryptobriefing.com/cme-solana-xrp-futures-options-launch-2025/
Share
BitcoinEthereumNews2025/09/18 01:07
If you put $1,000 in Intel at the start of 2025, here’s your return now

If you put $1,000 in Intel at the start of 2025, here’s your return now

The post If you put $1,000 in Intel at the start of 2025, here’s your return now appeared on BitcoinEthereumNews.com. Intel (NASDAQ: INTC) and Nvidia (NASDAQ: NVDA) announced a new partnership on Thursday, September 18, working on several generations of custom data center and computing chips designed to boost performance in hyperscale, enterprise, and consumer applications. As part of the collaboration, Nvidia, the undisputed leader of the semiconductor sector, will also invest $5 billion in Intel by purchasing its common stock at a price of $23.28 per share. Following the news, Intel stock jumped more than 30% in pre-market trading, while Nvidia saw a 3% uptick, a welcome change following weeks of shaky performance and controversies regarding its Chinese sales. Trading at $31.34 at the time of writing, INTC shares are up 54.99% year-to-date (YTD). INTC YTD stock price. Source: Google Accordingly, a $1,000 investment in the tech company at the start of the year would now be worth $1,549.90, giving you a return of $549.90. ‘The next era of computing’ The move follows a wave of fresh backing for the struggling Intel, including a nearly $9 billion U.S. government purchase of a 10% stake just weeks ago and a $2 billion investment from Japan’s SoftBank. As such, the deal has the potential to put Intel back into the game after years of trying to catch up not just with Nvidia but also AMD (NASDAQ: AMD) and Broadcom (NASDAQ: AVGO). “This historic collaboration tightly couples NVIDIA’s AI and accelerated computing stack with Intel’s CPUs and the vast x86 ecosystem — a fusion of two world-class platforms. Together, we will expand our ecosystems and lay the foundation for the next era of computing,” wrote Nvidia founder and chief executive officer (CEO), Jensen Huang.  However, the U.S. government’s direct involvement suggests that more is at stake than simply propping up Intel, as it likely reflects a broader concern about keeping America competitive…
Share
BitcoinEthereumNews2025/09/18 22:47