Skip to contents

use Anh Vu's OpenAI prompting to develop structured metadata about Bioconductor packages, targeting EDAM ontology and bio.tools schema

Usage

edamize(content_for_edam, temp = 0, prescrub = TRUE)

Arguments

content_for_edam

character(1) a URL for doc originating from the developer

temp

numeric(1) temperature setting for openAI chat, see `https://gptcache.readthedocs.io/en/latest/bootcamp/temperature/chat.html`, defaults to 0.0

prescrub

logical(1) if TRUE, apply the cleantxt function to the input before trying to assign EDAM tags; defaults to TRUE effort in the python operations in inst/curbioc; defaults to 1

Value

a list with components 'topic' and 'function', which can be converted to a data.frame using `mkdf`

Note

This function is not deterministic. For the provided example, the input to the function is a fixed text, but the output at the end can be NULL, a data frame with 12 rows, or a data frame with 14 rows. More work is needed to achieve greater predictability.

Examples

if (interactive()) {
  key = Sys.getenv("OPENAI_API_KEY")
  if (nchar(key)==0) stop("need to have OPENAI_API_KEY set")
  # avoid repetitious reprocessing of tximeta vignette
  # content = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html")
  content = readRDS(system.file("rds/tximetaFocused.rds", package="biocEDAM"))
  str(content)
  lk = edamize(content$focus)
  if (is.null(lk)) lk = edamize(content$focus)  # sometimes a second try is needed
  print(mkdf(lk))
  # try content derived from a pdf vignette
  # content2 = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/IRanges/inst/doc/IRangesOverview.pdf")
  content2 = readRDS(system.file("rds/IRangesOVdata.rds", package="biocEDAM"))
  lk2 = edamize(content2$focus)
  mkdf(lk2)
}
#> List of 5
#>  $ author    : chr [1:4] "Michael I. Love" "Charlotte Soneson" "Peter F. Hickey" "Rob Patro"
#>  $ topics    : chr [1:5] "Transcript quantification" "RNA-Seq" "Bioinformatics" "Genomic annotation" ...
#>  $ focused   : chr "\"Tximeta: transcript quantification import with automatic metadata\" presented by Michael I. Love, Charlotte S"| __truncated__
#>  $ coherence : int 87
#>  $ persuasion: num 0.95
#> Warning: strings not representable in native encoding will be translated to UTF-8
#>                                       uri
#> 1      http://edamontology.org/topic_0091
#> 2      http://edamontology.org/topic_3170
#> 3      http://edamontology.org/topic_3308
#> 4      http://edamontology.org/topic_0219
#> 5  http://edamontology.org/operation_3800
#> 6  http://edamontology.org/operation_3223
#> 7  http://edamontology.org/operation_2422
#> 8  http://edamontology.org/operation_0361
#> 9       http://edamontology.org/data_1234
#> 10    http://edamontology.org/format_1929
#> 11    http://edamontology.org/format_1930
#> 12      http://edamontology.org/data_0928
#> 13    http://edamontology.org/format_3752
#> 14    http://edamontology.org/format_3475
#>                                          tm
#> 1                            Bioinformatics
#> 2                                   RNA-Seq
#> 3                           Transcriptomics
#> 4  Data submission, annotation and curation
#> 5                    RNA-Seq quantification
#> 6    Differential gene expression profiling
#> 7                            Data retrieval
#> 8                       Sequence annotation
#> 9               Sequence set (nucleic acid)
#> 10                                    FASTA
#> 11                                    FASTQ
#> 12                  Gene expression profile
#> 13                                      CSV
#> 14                                      TSV
#>                                       uri                         tm
#> 1      http://edamontology.org/topic_0091             Bioinformatics
#> 2      http://edamontology.org/topic_0622                   Genomics
#> 3      http://edamontology.org/topic_0080          Sequence analysis
#> 4  http://edamontology.org/operation_2403          Sequence analysis
#> 5  http://edamontology.org/operation_0292         Sequence alignment
#> 6  http://edamontology.org/operation_0253 Sequence feature detection
#> 7  http://edamontology.org/operation_0291        Sequence clustering
#> 8  http://edamontology.org/operation_2451        Sequence comparison
#> 9       http://edamontology.org/data_0849            Sequence record
#> 10    http://edamontology.org/format_1929                      FASTA
#> 11    http://edamontology.org/format_1930                      FASTQ
#> 12      http://edamontology.org/data_0863         Sequence alignment
#> 13    http://edamontology.org/format_2573                        SAM
#> 14    http://edamontology.org/format_2572                        BAM