<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Protein | Xuhong Wang</title><link>https://wangxuhongcn.github.io/en/tags/protein/</link><atom:link href="https://wangxuhongcn.github.io/en/tags/protein/index.xml" rel="self" type="application/rss+xml"/><description>Protein</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Mon, 15 Dec 2025 00:00:00 +0000</lastBuildDate><image><url>https://wangxuhongcn.github.io/media/icon_hu_982c5d63a71b2961.png</url><title>Protein</title><link>https://wangxuhongcn.github.io/en/tags/protein/</link></image><item><title>BioBridge: Letting LLMs Truly Understand Proteins Without Sacrificing General Ability</title><link>https://wangxuhongcn.github.io/en/post/biobridge/</link><pubDate>Mon, 15 Dec 2025 00:00:00 +0000</pubDate><guid>https://wangxuhongcn.github.io/en/post/biobridge/</guid><description>
&lt;details class="print:hidden xl:hidden" open>
&lt;summary>Table of Contents&lt;/summary>
&lt;div class="text-sm">
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#abstract">Abstract&lt;/a>&lt;/li>
&lt;li>&lt;a href="#why-existing-models-fail-on-real-biological-tasks">Why Existing Models Fail on Real Biological Tasks&lt;/a>&lt;/li>
&lt;li>&lt;a href="#three-core-bottlenecks">Three Core Bottlenecks&lt;/a>&lt;/li>
&lt;li>&lt;a href="#three-core-innovations-in-biobridge">Three Core Innovations in BioBridge&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#1-domain-incremental-continual-pretraining">1. Domain-Incremental Continual Pretraining&lt;/a>&lt;/li>
&lt;li>&lt;a href="#2-protein-language-semantic-alignment-via-plm-projector">2. Protein-Language Semantic Alignment via PLM-Projector&lt;/a>&lt;/li>
&lt;li>&lt;a href="#3-end-to-end-multitask-fine-tuning">3. End-to-End Multitask Fine-Tuning&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#results-a-general-model-finally-approaches-specialist-protein-models">Results: A General Model Finally Approaches Specialist Protein Models&lt;/a>
&lt;ul>
&lt;li>&lt;a href="#stronger-specialist-performance">Stronger specialist performance&lt;/a>&lt;/li>
&lt;li>&lt;a href="#general-capability-is-largely-preserved">General capability is largely preserved&lt;/a>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="#what-the-ablations-show">What the Ablations Show&lt;/a>&lt;/li>
&lt;li>&lt;a href="#a-professional-transformation-of-general-llms">A Professional Transformation of General LLMs&lt;/a>&lt;/li>
&lt;li>&lt;a href="#related-links">Related Links&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/div>
&lt;/details>
&lt;p>
&lt;/p>
&lt;h2 id="abstract">Abstract&lt;/h2>
&lt;p>BioBridge addresses a long-standing mismatch in scientific AI: general large language models are strong at reasoning and contextual learning, but they do not understand proteins; protein language models are strong at specialist tasks such as structure-related prediction and function annotation, but they are far weaker at cross-task generalization and complex scientific reasoning.&lt;/p>
&lt;p>&lt;code>BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs&lt;/code> does not try to solve this by naively feeding protein sequences into an LLM. Instead, it lets a specialist protein model first read the protein, then maps that information into a semantic space the LLM can actually reason over. The result is a framework that pushes a general LLM much closer to expert-level protein understanding without giving up its original general ability.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Overall multitask comparison among BioBridge, Qwen2.5, and ESM2"
srcset="https://wangxuhongcn.github.io/en/post/biobridge/featured_hu_e8749f9bac812e68.webp 320w, https://wangxuhongcn.github.io/en/post/biobridge/featured_hu_9466d4b8d371f60f.webp 480w, https://wangxuhongcn.github.io/en/post/biobridge/featured_hu_85eef6599b4f04a1.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/biobridge/featured_hu_e8749f9bac812e68.webp"
width="760"
height="711"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="why-existing-models-fail-on-real-biological-tasks">Why Existing Models Fail on Real Biological Tasks&lt;/h2>
&lt;p>The problem here is not just benchmark score chasing. The original write-up correctly points out a deeper issue: many current models look strong on evaluation sets, but fall apart on real scientific tasks.&lt;/p>
&lt;ul>
&lt;li>Specialist protein models are powerful on narrow tasks, yet remain poor at cross-task transfer and natural-language explanation.&lt;/li>
&lt;li>General LLMs are strong at reasoning and language, but cannot interpret protein sequences as a structured scientific modality.&lt;/li>
&lt;li>Conventional full-parameter fine-tuning often triggers catastrophic forgetting: domain knowledge improves, but general reasoning and language understanding degrade.&lt;/li>
&lt;/ul>
&lt;p>This is why many models can appear fluent in biological QA but still underperform badly on target identification, solubility analysis, or protein interaction benchmarks.&lt;/p>
&lt;h2 id="three-core-bottlenecks">Three Core Bottlenecks&lt;/h2>
&lt;p>The paper frames the difficulty across three layers:&lt;/p>
&lt;ul>
&lt;li>&lt;code>generalization barrier&lt;/code>: performance on standard benchmarks does not transfer reliably across species, functions, and real downstream scenarios&lt;/li>
&lt;li>&lt;code>modality gap&lt;/code>: protein sequences carry structural and functional semantics that ordinary text tokenizers cannot parse&lt;/li>
&lt;li>&lt;code>capability conflict&lt;/code>: adding specialist knowledge to a general model often damages its original general-purpose competence&lt;/li>
&lt;/ul>
&lt;p>BioBridge matters because it treats these as a unified systems problem instead of patching them one by one.&lt;/p>
&lt;h2 id="three-core-innovations-in-biobridge">Three Core Innovations in BioBridge&lt;/h2>
&lt;p>The architecture can be understood in three layers.&lt;/p>
&lt;h3 id="1-domain-incremental-continual-pretraining">1. Domain-Incremental Continual Pretraining&lt;/h3>
&lt;p>The first issue is that an LLM usually lacks even basic biological grounding. BioBridge therefore uses a domain-incremental continual pretraining strategy over a curated biomedical corpus spanning textbooks, PubMed papers, and Swiss-Prot protein-description pairs, with replay mechanisms to preserve prior reasoning skills.&lt;/p>
&lt;p>The goal is not to overwrite the base model, but to let it absorb protein-related knowledge while retaining its original strengths in math, code, and scientific reasoning.&lt;/p>
&lt;h3 id="2-protein-language-semantic-alignment-via-plm-projector">2. Protein-Language Semantic Alignment via PLM-Projector&lt;/h3>
&lt;p>The second issue is that protein models and language models do not speak the same language.&lt;/p>
&lt;p>BioBridge uses &lt;code>ESM2&lt;/code> as the protein encoder, then applies a lightweight projector to map protein representations into the LLM&amp;rsquo;s language semantic space. Contrastive learning is used to align protein sequences with biological text descriptions at a deeper semantic level.&lt;/p>
&lt;p>This is a crucial design choice: proteins are not treated as plain strings, but as specialist representations translated into something the LLM can reason about.&lt;/p>
&lt;h3 id="3-end-to-end-multitask-fine-tuning">3. End-to-End Multitask Fine-Tuning&lt;/h3>
&lt;p>Finally, BioBridge concatenates protein embeddings and text instructions into a unified multimodal input and trains the model end to end in a generative way. A particularly important point from the source write-up is that this enables strong downstream behavior without relying on task-specific labeled datasets, using only protein-text supervision.&lt;/p>
&lt;p>That makes the framework feel less like a benchmark-specific trick and more like a scalable route toward domain-specialized scientific LLMs.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="The three-stage BioBridge training framework"
srcset="https://wangxuhongcn.github.io/en/post/biobridge/framework_hu_4e3a5e6211e24bb.webp 320w, https://wangxuhongcn.github.io/en/post/biobridge/framework_hu_b46db7e999218893.webp 480w, https://wangxuhongcn.github.io/en/post/biobridge/framework_hu_916340972ada31dc.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/biobridge/framework_hu_4e3a5e6211e24bb.webp"
width="760"
height="396"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="results-a-general-model-finally-approaches-specialist-protein-models">Results: A General Model Finally Approaches Specialist Protein Models&lt;/h2>
&lt;p>The experimental results lead to two main conclusions.&lt;/p>
&lt;h3 id="stronger-specialist-performance">Stronger specialist performance&lt;/h3>
&lt;p>On core protein tasks such as enzyme classification, subcellular localization, and metal-ion binding, BioBridge improves over &lt;code>Qwen2.5-7B-Instruct&lt;/code> by more than &lt;code>7%&lt;/code> on average. On protein-drug binding strength prediction, it reaches performance close to the specialist protein model &lt;code>ESM2&lt;/code>.&lt;/p>
&lt;p>This is important because it suggests that a general LLM is no longer merely imitating biological language, but starting to make genuinely specialist-quality judgments.&lt;/p>
&lt;h3 id="general-capability-is-largely-preserved">General capability is largely preserved&lt;/h3>
&lt;p>Just as important, BioBridge retains the original model&amp;rsquo;s general-purpose behavior. On benchmarks such as &lt;code>MMLU&lt;/code> and &lt;code>RACE&lt;/code>, it stays close to the base &lt;code>Qwen2.5-7B-Instruct&lt;/code> while clearly outperforming models that are only specialized for protein tasks.&lt;/p>
&lt;p>That is the core achievement of the framework: not specialist ability instead of generality, but as much of both as possible.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Response comparison on a subcellular localization case"
srcset="https://wangxuhongcn.github.io/en/post/biobridge/response-comparison_hu_6289584c18076ef1.webp 320w, https://wangxuhongcn.github.io/en/post/biobridge/response-comparison_hu_d1881adba28572a2.webp 480w, https://wangxuhongcn.github.io/en/post/biobridge/response-comparison_hu_cca3cbd4327980f.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/biobridge/response-comparison_hu_6289584c18076ef1.webp"
width="760"
height="306"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="flex justify-center ">
&lt;div class="w-full" >
&lt;img alt="Benchmark comparison across protein models"
srcset="https://wangxuhongcn.github.io/en/post/biobridge/benchmark-table_hu_5e94a0f47442ef9e.webp 320w, https://wangxuhongcn.github.io/en/post/biobridge/benchmark-table_hu_2d95cf8296a09f91.webp 480w, https://wangxuhongcn.github.io/en/post/biobridge/benchmark-table_hu_a05feaad20bed1dd.webp 760w"
sizes="(max-width: 480px) 100vw, (max-width: 768px) 90vw, (max-width: 1024px) 80vw, 760px"
src="https://wangxuhongcn.github.io/en/post/biobridge/benchmark-table_hu_5e94a0f47442ef9e.webp"
width="760"
height="338"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="what-the-ablations-show">What the Ablations Show&lt;/h2>
&lt;p>The source text also highlights two ablation findings:&lt;/p>
&lt;ul>
&lt;li>removing the biological pretraining stage causes a clear drop in downstream biology performance, showing that general LLMs do not naturally acquire deep biological semantics on their own&lt;/li>
&lt;li>removing the &lt;code>ESM2 + Projector&lt;/code> alignment path and feeding raw sequences directly as text into the LLM causes a sharp degradation, confirming that cross-modal alignment is essential rather than incidental&lt;/li>
&lt;/ul>
&lt;p>So the gains are not coming from a single lucky trick. They come from the coordinated design of specialist reading, semantic alignment, and general reasoning.&lt;/p>
&lt;h2 id="a-professional-transformation-of-general-llms">A Professional Transformation of General LLMs&lt;/h2>
&lt;p>The broader significance of BioBridge is not just that it improves protein modeling. It validates a more general route for scientific intelligence:&lt;/p>
&lt;ul>
&lt;li>specialist small models read and encode domain knowledge&lt;/li>
&lt;li>general LLMs handle explanation, reasoning, and transfer&lt;/li>
&lt;li>lightweight alignment modules and continual learning connect the two&lt;/li>
&lt;/ul>
&lt;p>If this pattern extends further, it should not be limited to proteins. It could become a broader recipe for chemistry, materials, medicine, and other scientific domains. In that sense, BioBridge is less a one-off biological system and more an early example of how general LLMs can undergo scalable professionalization through collaboration with domain models.&lt;/p>
&lt;h2 id="related-links">Related Links&lt;/h2>
&lt;ul>
&lt;li>Paper:
&lt;/li>
&lt;/ul></description></item></channel></rss>