Skip to contents

Construct a prompt to summarize a set of papers from a subject

Usage

build_prompt_subject(
  subject,
  title,
  summary,
  nsentences = 5L,
  instructions =
    c("I am giving you information about preprints published in bioRxiv recently.",
    "I'll give you the subject, preprint titles, and short summary of each paper.",
    "Please provide a general summary new advances in this subject/field in general.",
    "Provide this summary of the field in as many sentences as I instruct.",
    "Do not include any preamble text to the summary",
    "just give me the summary with no preface or intro sentence.")
)

Arguments

subject

The name of the subject.

title

A character vector of titles in the subject

summary

A character vector of the summaries of the paper provided by get_preprints() followed by add_prompt() followed by add_summary().

nsentences

The number of sentences to summarize the subject in.

instructions

Instructions to the prompt. This can be a character vector that gets collapsed into a single string.

Value

A string containing the prompt.

Examples

title <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(title)
summary <- example_preprints |> dplyr::filter(subject=="bioinformatics") |> dplyr::pull(summary)
build_prompt_subject(subject="bioinformatics", title=title, summary=summary)
#> [1] "I am giving you information about preprints published in bioRxiv recently. I'll give you the subject, preprint titles, and short summary of each paper. Please provide a general summary new advances in this subject/field in general. Provide this summary of the field in as many sentences as I instruct. Do not include any preamble text to the summary just give me the summary with no preface or intro sentence.\n\nSubject: bioinformatics\nNumber of sentences in summary: 5\n\nHere are the titles and summaries:\n\nTitle: Integrity and miss grouping as support for clusters in agglomerative hierarchical methods: the R-package octopucs\nSummary: The proposed method assesses cluster support throughout hierarchical analyses by compiling a consensus topology and using ecological concepts of reciprocal complementarities to define cluster integrity and contamination. This approach allows for building support for groups even when there is partial membership match after resampling, and was implemented in the R package octopucs, which showed robust detection of changes in group memberships compared to other methods.\n\nTitle: Sainsc: a computational tool for segmentation-free analysis of in-situ capture\nSummary: Sainsc is a computational tool that enables segmentation-free analysis of spatially resolved transcriptomics data, allowing for accurate cell-type assignment at the subcellular level without requiring manual cell border delineation. The tool provides efficient processing of high-resolution spatial data and can generate maps of cell types with corresponding confidence scores, making it a valuable resource for biomedical researchers working with complex tissue samples.\n\nTitle: BRACE: A novel Bayesian-based imputation approach for dimension reduction analysis of alternative splicing at single-cell resolution\nSummary: Alternative splicing represents an additional layer of complexity in gene expression profiles, but analyzing it at single-cell resolution is challenging due to missing data. This paper introduces BRACE, a Bayesian-based imputation approach that improves upon existing methods and enables dimension reduction analysis of alternative splicing events at single-cell resolution.\n\nTitle: Topological embedding and directional feature importance in ensemble classifiers for multi-class classification\nSummary: Researchers developed a new metric called class-based direction feature importance (CLIFI) to provide interpretable insights into the decision-making process of ensemble classifiers for multi-class classification problems, specifically in the context of cancer biomarker identification. The CLIFI metric was incorporated into four algorithms and applied to The Cancer Genome Atlas proteomics data, resulting in high F1-scores and allowing for the visualization of model decision-making functions and the identification of heterogeneity in several proteins across different cancer types.\n\nTitle: SeuratExtend: Streamlining Single-Cell RNA-Seq Analysis Through an Integrated and Intuitive Framework\nSummary: SeuratExtend is an R package that integrates essential tools and databases for single-cell RNA sequencing (scRNA-seq) data analysis, streamlining the process through a user-friendly interface. The package offers various analyses, including functional enrichment and gene regulatory network reconstruction, and seamlessly integrates multiple databases and popular Python tools.\n\nTitle: An Evolutionary Statistics Toolkit for Simplified Sequence Analysis on Web with Client-Side Processing\nSummary: The \"Evolutionary Statistics Toolkit\" is a web-based platform that integrates multiple evolutionary statistics tools for simplified sequence analysis, including Tajima's D calculator and Shannon's Entropy. The open-source toolkit facilitates streamlined workflows for researchers in evolutionary biology and genomics, and also serves as an educational interactive website for beginners in evolutionary statistics.\n\nTitle: A map of integrated cis-regulatory elements enhances gene regulatory analysis in maize\nSummary: The authors integrated various methods for profiling cis-regulatory elements (CREs) in maize, resulting in maps of integrated CREs that show increased completeness and precision. These maps were used to infer drought-specific gene regulatory networks and identify candidate regulators of maize drought response, as well as to study the potential role of transposable elements in regulating gene expression.\n\nTitle: MOSTPLAS: A Self-correction Multi-label Learning Model for Plasmid Host Range Prediction\nSummary: Plasmid host range prediction tools are essential for understanding how plasmids promote bacterial evolution, but existing learning-based tools struggle due to limited well-annotated training samples. The proposed model, MOSTPLAS, addresses this issue with a self-correction multi-label learning approach that uses pseudo label learning and asymmetric loss to facilitate training with incomplete labels.\n\nTitle: Bootstrap Evaluation of Association Matrices (BEAM) for Integrating Multiple Omics Profiles with Multiple Outcomes\nSummary: The authors propose Bootstrap Evaluation of Association Matrices (BEAM), a new statistical method that integrates multiple omics profiles with multiple clinical endpoints to identify significant associations between them. BEAM outperformed other integrated analysis methods in simulations and identified biologically relevant genes in a pediatric leukemia application that were missed by univariate screens and other methods.\n\nTitle: Thermodynamic modeling of Csr/Rsm- RNA interactions capture novel, direct binding interactions across the Pseudomonas aeruginosa transcriptome\nSummary: Researchers developed a thermodynamic model to predict interactions between the post-transcriptional regulator RsmA and mRNAs in Pseudomonas aeruginosa, predicting 1043 direct binding interactions, including 457 novel targets. The predictions were validated through in vitro binding assays and in vivo translational reporters, revealing direct regulation of genes involved in quorum sensing and the Type IV Secretion system, expanding the known pool of RsmA target genes.\n\nTitle: Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications\nSummary: ChatGPT was tested on its ability to extract data from publications on natural product bioactivity and biosynthesis, which is crucial for training models that predict natural product activity from biosynthetic gene clusters. The results showed that ChatGPT performed well in identifying papers describing natural product discovery and extracting information about the product's bioactivity, but struggled with extracting accession numbers for the biosynthetic gene cluster or producer's genome.\n\nTitle: Genome-Wide Analysis of TCP Family Genes and Their Constitutive Expression Pattern Analysis in the Melon (Cucumis melo)\nSummary: This study identified and characterized 29 putative TCP genes in melon, classifying them into two classes and analyzing their chromosomal location, gene structure, and expression patterns. The results suggest that some CmTCP genes may have similar functions to their homologs in other plant species, while others may have undergone functional diversification, providing a resource for future investigations into their roles in melon development.\n\nTitle: Single-cell differential expression analysis between conditions within nested settings\nSummary: Researchers compared various methods for differential expression analysis of single-cell transcriptomics data and found that methods designed specifically for single-cell data do not offer performance advantages over conventional pseudobulk methods like DESeq2 when applied to individual datasets. However, permutation-based methods excel in performance for atlas-level analysis, but require significantly longer run times, making DREAM a compromise between quality and runtime.\n\nTitle: CoMPHI: A Novel Composite Machine Learning Approach Utilizing Multiple FeatureRepresentation to Predict Hosts of Bacteriophages\nSummary: Here is a 2-sentence summary of the paper:  This study introduces CoMPHI, a novel composite machine learning approach that combines multiple feature representations to predict hosts of bacteriophages, with potential applications in phage therapy for treating bacterial infections. The model achieves high prediction accuracy, with an Area Under the ROC Curve (AUC) of up to 96.7% and accuracy of up to 95.1%, outperforming existing methods due to its inclusion of alignment scores and use of both nucleotide and protein sequences from phages and hosts.\n\nTitle: FourierMIL: Fourier filtering-based multiple instance learning for whole slide image analysis\nSummary: The paper presents FourierMIL, a multiple instance learning framework that uses the discrete Fourier transform to analyze whole-slide images (WSIs) in digital pathology. The method captures both global and local dependencies within WSIs and outperforms existing state-of-the-art methods in tumor classification tasks on gigapixel-resolution WSIs.\n\nTitle: Multiple Protein Structure Alignment at Scale with FoldMason\nSummary: Here is a 2-sentence summary of the paper:  FoldMason is a new method for multiple protein structure alignment that can handle hundreds of thousands of structures at scale with high speed and accuracy. It leverages the structural alphabet from Foldseek to compute confidence scores, provide interactive visualizations, and support large-scale protein structure analysis and phylogenetic studies.\n\nTitle: Deciphering octoploid strawberry evolution with serial LTR similarity matrices for subgenome partition\nSummary: A novel approach was developed to assign polyploid genome assemblies to subgenomes using long terminal repeat retrotransposons (LTR-RTs) and the Serial Similarity Matrix (SSM) method, which is particularly useful for genomes without known diploid ancestors. The SSM approach was validated using well-studied allopolyploidy genomes and then applied to the octoploid strawberry genome, revealing three allopolyploidization events in its evolutionary history.\n\nTitle: IDENTIFICATION OF IMMUNE RESPONSE AND RNA NETWORK OF RHEUMATOID ARTHRITIS AND MOLECULAR DOCKING OF CELASTRUS PANICULATUS AS POTENTIAL THERAPEUTIC AGENT\nSummary: This study used bioinformatics analysis to identify immune responses, microRNA-hub genes networks, and potential therapeutic agents for rheumatoid arthritis (RA), a complex autoimmune disease with an unknown pathogenesis. The researchers found several hub genes and miRNAs associated with RA, and identified oleic acid and zeylasterone as potential novel drug candidates against the disease through molecular docking analysis of Celastrus paniculatus phytochemical compounds.\n\nTitle: Imputing abundance of over 2500 surface proteins from single-cell transcriptomes with context-agnostic zero-shot deep ensembles\nSummary: SPIDER is a deep ensemble model that predicts the abundance of over 2500 surface proteins from single-cell transcriptomes with improved generalization across diverse contexts such as tissues or disease states. The model outperforms other state-of-the-art methods and has various applications including cell type annotation, biomarker/target identification, and cell-cell interaction analysis in cancer research.\n\nTitle: Modelling Protein-Glycan Interactions with HADDOCK\nSummary: Glycans play important roles in living organisms by interacting with proteins for information transfer and signalling purposes, making it essential to determine the three-dimensional structures of protein-glycan complexes. The molecular docking approach HADDOCK was used to predict protein-glycan complexes with a top 5 success rate of 70% for bound datasets and 40% for unbound datasets using a benchmark of 89 complexes.\n\nTitle: Machine Learning Reveals Key Glycoprotein Mutations and Rapidly Assigns Lassa Virus Lineages\nSummary: Machine learning and phylogenetic analysis of Lassa virus glycoprotein sequences revealed key mutations and genetic differences between Nigerian lineages and those from other West African countries. The study identified specific amino acid positions that are highly variable among the lineages, which may explain structural and phenotypical differences, and developed a machine learning-based tool for rapid lineage classification.\n\nTitle: RESP2: An uncertainty aware multi-target multi-property optimization AI pipeline for antibody discovery\nSummary: The RESP2 pipeline is an AI-powered tool designed to optimize the discovery of therapeutic antibodies against infectious disease pathogens, taking into account multiple targets and properties such as specificity, low immunogenicity, and high affinity. The pipeline uses a suite of methods to estimate uncertainty in predictions and has been successfully applied to discover a highly human antibody with broad binding to variants of the COVID-19 spike protein receptor binding domain.\n\nTitle: Extending the capabilities of deconvolution to provide cell type specific pathway analysis of bulk RNA-seq data for idiopathic pulmonary fibrosis\nSummary: A deconvolution method was applied to bulk RNA-seq data from idiopathic pulmonary fibrosis (IPF) samples to correct for changes in cell type proportions and provide cell-type specific pathway analysis. The results showed significant increases in fibroblasts and myofibroblasts, decreases in vascular endothelial capillary cells, and IPF-related changes in extracellular matrix organization and TGF-{beta} regulation, as well as the involvement of interferon signaling in ATII cells.\n\nTitle: A survey of ADP-ribosyltransferase families in the pathogenic Legionella\nSummary: A comprehensive bioinformatic survey of 41 Legionella species identified 63 proteins with significant sequence or structural similarity to known ADP-ribosyltransferases (ARTs), organized into 39 ART-like families, including 26 novel families. The study found that most members of the novel ART families are predicted effectors, presenting promising targets for understanding Legionella pathogenicity and developing therapeutic strategies.\n\nTitle: A replicable and modular benchmark for long-read transcript quantification methods\nSummary: Researchers have developed a replicable benchmark for evaluating long-read transcript quantification methods using synthetic RNA-seq datasets, which can be easily extended to include new tools or data sets. The study reveals discrepancies with previously published results and highlights the importance of high-quality simulated data in assessing the robustness of certain approaches.\n\nTitle: Logan: Planetary-Scale Genome Assembly Surveys Life's Diversity\nSummary: The NCBI Sequence Read Archive contains over 50 petabases of DNA sequencing data across 27 million datasets, but its size makes it impractical to search for specific genetic sequences within a reasonable time frame. To address this issue, the authors used cloud computing to perform genome assembly on each dataset and created the Logan assemblage, which is now freely available and enables faster querying of the data, with some queries completing in as little as 11 hours.\n\nTitle: Cell-type specific epigenetic clocks to quantify biological age at cell-type resolution\nSummary: Epigenetic clocks have been developed to estimate biological age, but most are based on heterogeneous bulk tissues and reflect both changes in cell-type composition and individual cell aging. This study created neuron- and hepatocyte-specific DNA methylation clocks that provide improved estimates of chronological age and detect accelerated biological aging in Alzheimer's disease and liver pathology.\n\nTitle: Genomic and transcriptomic analyses of Heteropoda venatoria reveal the expansion of P450 family for starvation resistance in spider\nSummary: The genome of Heteropoda venatoria was sequenced and comparative genomic analysis revealed significant expansions in gene families related to lipid metabolism, including cytochrome P450 and steroid hormone biosynthesis genes. The study found that during starvation, H. venatoria undergoes a series of physiological changes, including the activation of fatty acid metabolism and protein degradation pathways, and the expression of expanded P450 gene families, which help the spider maintain a low-energy metabolic state and endure longer periods of starvation.\n\nTitle: Annotation Vocabulary (Might Be) All You Need\nSummary: The authors introduce the \"Annotation Vocabulary\", a language of protein properties defined by structured ontologies that can be used to train transformer models without reference to amino acid sequences. They demonstrate the effectiveness of this approach in various experiments, achieving state-of-the-art results on several common datasets with competitive performance on others, and generating high-quality de novo protein sequences from annotation-only prompts.\n\nTitle: AncFlow: An Ancestral Sequence Reconstruction Approach for Determining Novel Protein Structural\nSummary: Here is the summary in 2 sentences:  AncFlow is an automated software pipeline that integrates phylogenetic analysis, subfamily identification, and ancestral sequence reconstruction (ASR) to generate ancestral protein sequences for structural prediction using state-of-the-art tools like AlphaFold. The pipeline was validated on two well-characterized protein families, providing insights into the evolutionary mechanisms underpinning functional diversification within these families and demonstrating its potential to guide protein engineering efforts."