Construct a prompt to summarize a set of papers from a subject

Usage

build_prompt_subject(
  subject,
  title,
  summary,
  nsentences = 5L,
  instructions = c("I am giving you information about recent bioRxiv/medRxiv preprints.",
    "I'll give you the subject, preprint titles, and short summary of each paper.",
    "Please provide a general summary new advances in this subject/field in general.",
    "Provide this summary of the field in as many sentences as I instruct.",
    "Do not include any preamble text to the summary",
    "just give me the summary with no preface or intro sentence.")
)

Arguments

subject: The name of the subject.
title: A character vector of titles in the subject
summary: A character vector of the summaries of the paper provided by get_preprints() followed by add_prompt() followed by add_summary().
nsentences: The number of sentences to summarize the subject in.
instructions: Instructions to the prompt. This can be a character vector that gets collapsed into a single string.

Value

A string containing the prompt.

Examples

title <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(title)
summary <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(summary)
build_prompt_subject(subject="bioinformatics", title=title, summary=summary)
#> [1] "I am giving you information about recent bioRxiv/medRxiv preprints. I'll give you the subject, preprint titles, and short summary of each paper. Please provide a general summary new advances in this subject/field in general. Provide this summary of the field in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence.\n\nSubject: bioinformatics\nNumber of sentences in summary: 5\n\nHere are the titles and summaries:\n\nTitle: MedGraphNet: Leveraging Multi-Relational Graph Neural Networks and Text Knowledge for Biomedical Predictions\nSummary: MedGraphNet leverages multi-relational Graph Neural Networks and text knowledge to improve biomedical predictions by initializing nodes using informative embeddings from existing text knowledge, allowing for robust integration of various data types and improved generalizability. The model demonstrates superior performance compared to traditional single-relation approaches in scenarios with isolated or sparsely connected nodes, particularly in identifying disease-gene associations and drug-phenotype relationships, and shows promising results in accurately inferring drug side effects without direct training on such data.\n\nTitle: High-throughput bacterial aggregation analysis in droplets\nSummary: The communal lifestyle of bacteria can contribute significantly to antimicrobial resistance by promoting biofilm formation. A key approach to addressing this issue is to develop novel techniques for analyzing bacterial behavior, such as those enabled by droplet-based platforms and image analysis methods.\n\nTitle: scParadise: Tunable highly accurate multi-task cell type annotation and surface protein abundance prediction\nSummary: scAdam outperforms existing methods in annotating rare cell types with high accuracy and consistency across diverse datasets. scEve enhances clustering and cell type separation through improved surface protein prediction, leading to better characterization of complex tissues.\n\nTitle: Camera Paths, Modeling, and Image Processing Tools for ArtiaX\nSummary: ArtiaX is a plugin that has been extended to improve the analysis and visualization of cryo-electron tomography data through advanced visualization techniques. The plugin allows for the generation of diverse models with putative particle positions and orientations, as well as a coarse grained algorithm to rectify overlaps in template matching, driving camera position and facilitating movie creation with fundamental image filtering options.\n\nTitle: dScaff - an automatic bioinformatics framework for scaffolding draft de novo assemblies based on reference genome data\nSummary: dScaff is an automatic bioinformatics framework designed for scaffolding draft de novo assemblies based on reference genome data. The tool uses a series of bash and R scripts to create a minimal complete scaffold from a genome assembly, with potential future features to be implemented, including using reference chromosomes or scaffolds.\n\nTitle: Jaeger: an accurate and fast deep-learning tool to detect bacteriophage sequences\nSummary: Jaeger's accuracy and speed in identifying bacteriophage sequences outperform existing deep-learning tools by consistently producing few false positives despite encountering diverse viral sequences. The novel method achieves an estimated 2-27% false discovery rate when applied to over 16,000 metagenomic assemblies, which is significantly lower than the benchmarking paper where deep-learning tools produced many false positives.\n\nTitle: AI-Augmented R-Group Exploration in Medicinal Chemistry\nSummary: The paper presents a novel approach to enhancing free-wing QSAR models by embedding R-groups with atom-centric pharmacophoric features, allowing for the distinction of regioisomers and improved predictivity across 12 public datasets. The proposed method is integrated into an open-source program, enabling its application in various scenarios, including classic free-Wilson analysis and exploration of uncharted chemical space facilitated by AI-generated building blocks.\n\nTitle: OPLS-based Multiclass Classification and Data-Driven Inter-Class Relationship Discovery\nSummary: OPLS-DA models are widely used in metabolomics for two-class comparisons due to their strong discrimination capabilities, but these models face challenges in multiclass settings. An extension of OPLS-DA called OPLS-HDA integrates Hierarchical Cluster Analysis with the OPLS-DA framework to create a decision tree that addresses multiclass classification challenges and provides intuitive visualization of inter-class relationships.\n\nTitle: STANCE: a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics\nSummary: STANCE, a unified statistical model to detect cell-type-specific spatially variable genes in spatial transcriptomics, was developed to address the challenges posed by existing methods in detecting spatially variable genes (SVGs) and cell type-specific spatially variable genes (ctSVGs). The proposed method integrates gene expression, spatial location, and cell type composition through a linear mixed-effect model to identify both SVGs and ctSVGs in an initial stage, followed by a second stage test dedicated to ctSVG detection.\n\nTitle: AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow\nSummary: AsaruSim simulates synthetic single-cell long-read Nanopore datasets that closely mimic real experimental data by employing a multi-step process. It includes the creation of a synthetic UMI count matrix, generation of perfect reads, optional PCR amplification, introduction of sequencing errors, and comprehensive quality control reporting.\n\nTitle: Building a literature knowledge base towards transparent biomedical AI\nSummary: LiteralGraph extracts biomedical terms and relationships from PubMed literature, establishing a comprehensive knowledge graph. The resulting Genomic Literature Knowledge Base consolidates over 263 million biomedical terms, 14 million relationships, and 10 million genomic events across multiple sources, including nine established repositories.\n\nTitle: Accurate non-invasive quantification of astaxanthin content using hyperspectral images and machine learning\nSummary: The authors investigated a method to accurately quantify astaxanthin content in Haematococcus pluvialis microalgae cultures using hyperspectral images and machine learning. They found that this approach, combining reflectance hyperspectral imaging with a 1-dimensional convolutional neural network, had low average prediction error across a range of astaxanthin contents, although it was unreliable at very low levels (<0.6 micrograms mg-1).\n\nTitle: AlphaMut: a deep reinforcement learning model to suggest helix-disrupting mutations\nSummary: The authors propose a deep reinforcement learning model called AlphaMut to predict helix-disrupting mutations in proteins. AlphaMut identifies amino acids crucial for maintaining structural integrity and predicts key mutations that could alter protein function.\n\nTitle: Beyond Static Brain Atlases: AI-Powered Open Databasing and Dynamic Mining of Brain-Wide Neuron Morphometry\nSummary: NeuroXiv is a large-scale database that provides detailed 3D morphologies of individual neurons mapped to a standard brain atlas, allowing for dynamic, interactive neuroscience applications. The database offers a comprehensive collection of 175,149 atlas-oriented reconstructed morphologies of individual neurons from over 518 mouse brains, classified into 292 distinct types and mapped into the Common Coordinate Framework Version 3 (CCFv3).\n\nTitle: Metabolic modeling identifies determinants of thermal growth responses in Arabidopsis thaliana\nSummary: The paper developed an enzyme-constrained model of Arabidopsis thaliana's metabolism, which facilitates predictions of growth-related phenotypes at different temperatures and identifies genes affecting plant growth at suboptimal temperatures. This model was validated using mutant lines, demonstrating its potential in accurately predicting plant thermal responses and providing a template for developing climate-resilient crops.\n\nTitle: Decoding Protein Dynamics: ProFlex as a Linguistic Bridge in Normal Mode Analysis\nSummary: Artificial intelligence has revolutionized structural bioinformatics with AlphaFold being arguably the most impactful development to date. The structural atlases generated by these methods present significant opportunities for unraveling biological mysteries, but also pose challenges in leveraging such massive datasets effectively.\n\nTitle: Exploring midgut expression dynamics: longitudinal transcriptomic analysis of adult female Amblyomma americanum midgut and comparative insights with other hard tick species\nSummary: The study investigates the transcriptomic dynamics of the midgut in adult female Amblyomma americanum ticks during different feeding stages, revealing 15,599 putative DNA coding sequences and highlighting dynamic transcriptional changes as feeding progresses. The analysis also identified conserved transcripts across three hard tick species, providing insight into the physiological pathways relevant to the tick midgut and potential avenues for developing control methods targeting multiple tick species.\n\nTitle: Designing of thermostable proteins with a desired melting temperature\nSummary: We developed a regression method for predicting protein melting temperatures (Tm) using 17,312 non-redundant proteins and achieved the highest Pearson correlation of 0.80 with an R2 of 0.63 between predicted and actual Tm values. Our best model, fine-tuned on large language models such as ProtBert, achieved a maximum correlation of 0.89 with an R2 of 0.80, demonstrating improved performance in predicting protein stability at higher temperatures.\n\nTitle: Joint Modeling of Cellular Heterogeneity and Condition Effects with scPCA in Single-Cell RNA-Seq\nSummary: scRNA-seq in multi-condition experiments enables the systematic assessment of treatment effects by analyzing gene expression profiles.   scPCA is a flexible DR framework that jointly models cellular heterogeneity and conditioning variables, allowing for an integrated factor representation and revealing transcriptional changes across conditions and components.\n\nTitle: Identification of potential inhibitors against Inosine 5'-Monophosphate Dehydrogenase of Cryptosporidium parvum through an integrated in silico approach\nSummary: A total of 24 bioactive phytochemicals were screened virtually using molecular docking and ADMET analyses to identify potential inhibitors against Inosine 5'-Monophosphate Dehydrogenase (IMPDH) of Cryptosporidium parvum, with four lead compounds identified as Brevelin A, Vernodalin, Luteolin, and Pectolinarigenin. The lead compounds were found to possess favorable pharmacokinetic and pharmacodynamic properties, satisfactory toxicity analysis results, and no major side effects or violation of Lipinski's rules of five, indicating the possibility of oral bioavailability as potential drug candidates.\n\nTitle: Identification and Diagnostic Potential of Pyroptosis-Related Genes in Endometriosis: A Novel Bioinformatics Analysis\nSummary: Pyroptosis-related genes were identified through a bioinformatics analysis of endometriosis (EM) transcriptomic datasets, resulting in 26 differentially expressed genes that play a crucial role in the pathogenesis of EM.  A novel diagnostic model was constructed using LASSO regression based on pyroptosis scores, which included five key genes: KIF13B, BAG6, MYO5A, HEATR, and AK055981.\n\nTitle: Improving the accuracy of pose prediction by incorporating symmetry-related molecules\nSummary: The study aimed to improve the accuracy of pose prediction in molecular docking by incorporating symmetry-related molecules (SRMs). Redocking protein-ligand complexes with and without SRMs revealed that using SRMs significantly improved the prediction of biologically significant poses, as indicated by MM-GBSA calculations.\n\nTitle: Identification and study of Prolyl Oligopeptidases and related sequences in bacterial lineages\nSummary: The study examined ~32000 completely annotated bacterial genomes from the NCBI RefSeq Assembly database to identify annotated S9 family proteins, resulting in the discovery of ~53,000 bacterial S9 family proteins (referred to as POP homologues) which can be classified into distinct subfamilies through various machine-learning approaches and comprehensive analysis. These sequence homologues display distinct subclusters and class-specific motifs suggesting differences in substrate specificity in POP homologues.\n\nTitle: Learning-Augmented Sketching Offers Improved Performance for Privacy Preserving and Secure GWAS\nSummary: The introduction of trusted execution environments (TEEs) such as Intel SGX technology has enabled secure and privacy-preserving computation on the cloud, but stringent resource limitations pose a challenge for some TEEs. The SkSES method, which identifies significant SNPs in GWAS without disclosing sensitive genotype information, has been improved upon with a learning-augmented approach that achieves up to 40% accuracy gain compared to the original SkSES method.\n\nTitle: Liberality is More Explainable than PCA of Transcriptome for Vertebrate Embryo Development\nSummary: Liberality is a quantitative index of cellular differentiation and dedifferentiation that has been widely used for genome-scale data analysis, particularly in understanding vertebrate embryo development. The study analyzed a time course transcriptome dataset on vertebrate embryo development and found a trend that historically annotated embryo developmental stages matched changes in liberality, indicating the potential of liberality to analyze biological phenomena beyond just embryo development.\n\nTitle: Bacopa monnieri phytochemicals as promising BACE1 inhibitors for Alzheimers Disease Therapy\nSummary: Bacopa monnieri phytochemicals are investigated as potential BACE1 inhibitors for Alzheimer's Disease Therapy, with Bacopaside I showing superior binding affinity and interaction profile compared to established synthetic inhibitors. The study highlights the promising role of natural compounds in AD treatment, emphasizing their potential to overcome limitations faced in clinical settings, and advocates for a paradigm shift towards integrating traditional medicinal knowledge into contemporary drug discovery efforts.\n\nTitle: Accurate Multiple Sequence Alignment of Ultramassive Genome Sets\nSummary: The current state of multiple sequence alignment (MSA) is insufficient for handling ultramassive genome sets due to challenges in scalability and accuracy. The proposed algorithms, including directed acyclic graph construction, profile hidden Markov model training, and graph-based alignment, significantly improve accuracy and acceleration of MSA compared to widely used MAFFT for genome set sizes ranging from 40,000 to over 4 million.\n\nTitle: Machine Learning Driven Simulations of SARS-CoV-2 Fitness Landscape\nSummary: The SARS-CoV-2 infection is caused by interactions between the receptor binding domain of viral spike proteins and host cell ACE2 receptors, with mutations in the spike protein leading to neutralizing antibody escape and breakthrough infections. Machine learning-driven simulations combined with deep mutational scanning data predict variants of concern not seen in the training data and sample statistics of the fitness landscape, providing insight into the relationship between RBD sequence elements and emerging viral strains.\n\nTitle: Modelling dynamics of human NDPK hexamer structure, stability and interactions\nSummary: The precise assembly of the NDPK hexameric structure into homo- /hetero-oligomeric complexes is necessary for kinase activity but has been poorly understood due to high subunit homology, experimental challenges, and limited data on in vivo heterohexamer formation and subunit abundances across cellular compartments. A conserved Arg27 residue plays a key role in hexamer assembly, mediating inter- and intra-molecular monomeric interactions and ensuring similar hexameric assembly across subunits.\n\nTitle: GuaCAMOLE: GC-bias aware estimation improves the accuracy of metagenomic species abundances\nSummary: GuaCAMOLE is a novel computational method that detects and removes GC bias from metagenomic sequencing data, which affects the accuracy of quantifying microbial community compositions. The algorithm reports unbiased abundances and corrects the abundance of clinically relevant GC-poor species by up to a factor of two in gut microbiomes of colorectal cancer patients."