Impact Of Genomics Undergrad
Published:
Impact of UNAM’s undergrad in Genomic Sciences on academic research
Alejandro Reyes
2019-07-27
UNAM’s Undergrad program in Genomic Sciences (LCG, for its initials in Spanish) was created with the objective of providing students with the necessary background to develop genomics research. The program has a very strict selection criteria, involving two exams and one interview. Classes are small (20 to 30 people) compared to other UNAM programs. The first students graduated in August of 2009. I know that most alumni continue their education and further obtain masters and PhD degrees. Some of them enter prestigious PhD programs both in Mexico and abroad. Perhaps the most accomplished alumni, academically speaking, are around 5 to 10 graduates that now lead independent research groups. I think most these research groups are based at UNAM Juriquilla, Mexico. Others have accomplish a lot in their careers outside of academia. For example, Mariana Matus is the CEO and co-founder of a start-up company in Boston.
In any case, the objective of this document is to assess the impact that LCG has had in academic research, calculating the number of publications from LCG alumni as well as the impact of these publications.
The data
List of LCG alumni
In order to obtain the number of publication and citation statistics for each LCG alumni, I generated a Google spreadsheet that contains each student graduated by August 23th of 2019. The list of graduates, along with the date of graduation, was obtained from the LCG’s website. I manually filled the columns of the spreadsheet with Google Scholar identifiers and I tried to fill the class to which they belong when I could remember. I collected data for the first 6th classes that graduated from LCG. Then I got bored/tired and stopped. If anyone is interested in helping filling the spreadsheet further, please let me know. For now, the code below reads this compiled spreadsheet into R
.
library(googlesheets)
gap <- gs_title( "GoogleGenomicos" ) ## you need to have this spreadsheet in your google drive and be logged in
dat <- gs_read( gap )
head( dat )
## # A tibble: 6 x 4
## Name Graduation Generation GoogleScholarID
## <chr> <date> <dbl> <chr>
## 1 Rocío Domínguez Vidaña 2007-09-14 NA C87y0MMAAAAJ
## 2 Lucía Guadalupe Morales Reyes 2007-09-14 NA k6GlzNUAAAAJ
## 3 Selene Lizbeth Fernández Valverde 2007-09-24 NA iHnkhAgAAAAJ
## 4 Alejandra Eugenia Medina Rivera 2007-10-01 NA 7cF2WKQAAAAJ
## 5 Estefanía García Ruiz 2007-10-04 NA <NA>
## 6 Santiago Sandoval Motta 2007-10-04 NA 1Ud8j-UAAAAJ
Out of the data I compiled, I could find 73 profiles, which correspond to 26.74% of current graduates. These data corresponds to roughly 47.71 % of the students from the first six classes. Missing profiles are most probably not at random, as people with more publications are more likely to have a Google Scholar profile. This could result in inflated estimates and it is something to consider when reaching conclusions.
Publication numbers and citations
Having collected the Google scholar identifiers for LCG alumni, I am using the R CRAN package called scholar to scramble the data for each Google scholar profile. For each profile, I am obtaining the citation history, and the data of each of the publications. In each iteration, I am introducing Sys.sleep
commands randomizing the number of seconds following normal distributions to sample waiting times between each query in order to avoid captchas.
library(scholar)
googleIDs <- dat$GoogleScholarID
googleIDs <- googleIDs[!is.na( googleIDs )]
scholarDataFile <- "scholarData.rds"
if( !file.exists(scholarDataFile) ){
scholarData <- lapply( googleIDs, function(x){
pubs <- try( get_publications(x) )
Sys.sleep( abs( rnorm(1, 10, 5) ) )## randomize waiting times to avoid captcha's
cits <- try( get_citation_history(x) )
Sys.sleep( abs( rnorm(1, 60, 20) ) )
list( pubRecord=pubs,
citHistory=cits )
} )
names( scholarData ) <- googleIDs
saveRDS( scholarData, file=scholarDataFile )
}
scholarData <- readRDS(scholarDataFile)
Then, I do some data wrangling to convert them to long-formatted data frames and start exploring these data.
pubRecords <- lapply( scholarData, "[[", "pubRecord" )
citHistory <- lapply( scholarData, "[[", "citHistory" )
stopifnot(all(vapply(pubRecords, class, character(1)) == "data.frame"))
stopifnot(all(vapply(citHistory, class, character(1)) == "data.frame"))
pubRecords <- purrr::map_df( pubRecords, ~as.data.frame(.x), .id="GoogleScholarID")
citHistory <- purrr::map_df( citHistory, ~as.data.frame(.x), .id="GoogleScholarID")
Publication records
Where do LCG alumni publish?
Where do LCG alumni publish? The code below computes the number of articles published in each journal. Then, it sorts the journals depending on the number of publications co-authored by LCG alumni. Below I am plotting the journals that appear most frequently among the publication profiles of LCG alumni.
library(magrittr)
library(ggplot2)
library(cowplot)
pubRecords <- pubRecords %>%
dplyr::filter(journal != "")
theme_set(theme_cowplot())
pubPerJournal <- pubRecords %>%
dplyr::select( journal, cid ) %>%
unique() %>%
dplyr::group_by( journal=tolower(journal) ) %>%
dplyr::summarise( numb=dplyr::n() ) %>%
dplyr::arrange( desc(numb) ) %>%
dplyr::filter( numb > 5, journal != "" )
pubPerJournal$journal <- forcats::fct_reorder(pubPerJournal$journal, pubPerJournal$numb, .desc=TRUE)
levels(pubPerJournal$journal) <-
gsub("proceedings of the national academy of sciences", "pnas", levels( pubPerJournal$journal) )
levels(pubPerJournal$journal) <-
gsub("the american journal of human genetics", "ajhg", levels( pubPerJournal$journal) )
pubPerJournal %>%
ggplot( aes( journal, numb ) ) +
geom_point() +
theme(axis.text.x=element_text(angle=35, hjust=1)) +
labs(y="Number of publications", x="Journal") +
ylim(0, 29)
It is interesting that LCG alumni appear in 27 biorXiv preprints, the same number of Nature and PLOS ONE papers. As a second group of journals, we have PNAS, Nucleic Acids Research and Nat Communications with an average of 19 papers. Not surprisingly, the most frequent journals publish a lot of research in genomics.
Number of publications of LCG alumni
Now it is a good time to remember that our sample of profiles might be biased towards inflated publication numbers. Nevertheless, the histogram below shows the distribution on the number of publications per alumnus:
pubRecords <- dplyr::left_join( pubRecords, dat )
pubRecords %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation) ) %>%
ggplot(aes(number)) +
geom_histogram( bins=30 ) +
labs(x="Number of publications", y="Frequency")
The distribution above has a long tail. Obviously, the more time a person stays in research, the more publications that person will have. Thus, a more informative plot is one that also considers the date of graduation. To do so, I plotted the number of publications as a function of the years since graduation.
pubRecords %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation) ) %>%
dplyr::mutate( timeFromGrad=(Sys.Date()- graduation) ) %>%
ggplot( aes( timeFromGrad/30/12, number ) ) +
geom_point(alpha=0.5) +
labs(x="Time since graduation (years)", y="Number of publications")
There seems to be quite some variability. There are people that have many papers! The variability in the plot above could reflect differences in the collaborative environment between fields or institutions. For example, people working in more collaborative fields such as consortia will have more papers. Who are the alumni with the highest number of publications? Below is top 10 ranking, which indicates us that the clear outlier from the plots above corresponds to Claudia Gonzaga-Jauregui from the first LCG class, who has co-authored 68 papers.
pubRecords %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation), name=unique(Name) ) %>%
dplyr::arrange( desc(number) ) %>%
head(10)
## # A tibble: 10 x 4
## GoogleScholarID number graduation name
## <chr> <int> <date> <chr>
## 1 YMcmOsAAAAAJ 68 2007-11-28 Claudia Gabriela Gonzaga Jáuregui
## 2 nQuXihQAAAAJ 43 2009-07-02 María del Carmen Avila Arcos
## 3 Mbic02QAAAAJ 41 2011-08-01 Gabriel Cuellar Partida
## 4 Zkyg60AAAAAJ 39 2007-10-17 Gabriela Angélica Martínez Nava
## 5 h57-MykAAAAJ 37 2009-06-03 Leonardo Collado Torres
## 6 WIgmpAMAAAAJ 34 2008-06-25 Miguel Enrique Rentería Rodríguez
## 7 aXchdQQAAAAJ 32 2009-06-03 María Gutiérrez Arcelus
## 8 7cF2WKQAAAAJ 30 2007-10-01 Alejandra Eugenia Medina Rivera
## 9 sqH-GCQAAAAJ 29 2009-01-09 Angélica Paola Hernández Pérez
## 10 3khb6PYAAAAJ 28 2011-10-06 José Víctor Moreno Mayar
Fifth generation compared to others
I have the feeling that my class, also known as La Quinta, has done better in publishing compared to other classes. In order to test this hypothesis, I filtered the data for only those alumni that graduated before the first alumni from the 5th class. To account for the time each person has spent in research, I compared the number of publications after 7 years of graduation. If we subject my feeling to hypothesis testing, we see that there is not a significant difference in the number of publications between alumni from La Quinta and other classes. My feeling seems to be only that: a feeling.
pubRecords$year <- as.Date(paste0( pubRecords$year, "-01-01"))
firstQuinto <- max(dat$Graduation[which(dat$Generation == 5)])
pubRecords %>%
dplyr::filter( Graduation - firstQuinto < 0 | Generation == 5, year - Graduation < 365*7) %>%
dplyr::mutate( Generation=ifelse(Generation != 5 | is.na(Generation), "1st-4th", "5th") ) %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::summarize( numbs=dplyr::n(), Generation=unique(Generation) ) %>%
wilcox.test( numbs ~ Generation, data=. )
##
## Wilcoxon rank sum test with continuity correction
##
## data: numbs by Generation
## W = 365.5, p-value = 0.5042
## alternative hypothesis: true location shift is not equal to 0
Impact of research as measured by citation numbers
One way the impact of a publication is assessed is by the number citations in literature. Nevertheless, people have argued that the most impactful research is not reflected in the number of citations 1 and the citation rates vary from field to field. For example, papers in human genomics are cited more frequently than plant genomics papers. Anyway, if we explore the distribution of citations per paper, we observed the typical distribution of citations counts, which has been the subject of some statistics papers2.
citHistory <- dplyr::left_join( citHistory, dat )
citHistory$year <- as.Date(paste0( citHistory$year, "-12-31"))
hist( pubRecords$cites, 100, xlab="Number of citations", main="" )
Below, I show the most cited papers co-authored by LCG alumni. The two first papers are consortium papers, one from the International HapMap 3 Consortium and a review paper giving an overview of the Bioconductor project. There is also a Science paper coauthored by two LCG alumni, María Avila and Victor Moreno-Mayar, who have become leaders in the field of ancient DNA.
pubRecords %>%
dplyr::arrange( desc(cites) ) %>%
dplyr::select( title, journal, Name, cites ) %>%
as.data.frame() %>%
head(10)
## title
## 1 Integrating common and rare genetic variation in diverse human populations
## 2 Orchestrating high-throughput genomic analysis with Bioconductor
## 3 Defining the core Arabidopsis thaliana root microbiome
## 4 Transcriptome genetics using second generation sequencing in a Caucasian population
## 5 Detecting differential usage of exons from RNA-seq data
## 6 Whole-genome sequencing in a patient with Charcot–Marie–Tooth neuropathy
## 7 Common regulatory variation impacts gene expression in a cell type–dependent manner
## 8 Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library
## 9 An Aboriginal Australian genome reveals separate human dispersals into Asia
## 10 An Aboriginal Australian genome reveals separate human dispersals into Asia
## journal Name
## 1 Nature Claudia Gabriela Gonzaga Jáuregui
## 2 Nature methods Alejandro Reyes Quiroz
## 3 Nature Sur Herrera Paredes
## 4 Nature María Gutiérrez Arcelus
## 5 Genome Research Alejandro Reyes Quiroz
## 6 New England Journal of Medicine Claudia Gabriela Gonzaga Jáuregui
## 7 Science María Gutiérrez Arcelus
## 8 Nature biotechnology Martín Del Castillo Velasco Herrera
## 9 Science María del Carmen Avila Arcos
## 10 Science José Víctor Moreno Mayar
## cites
## 1 2171
## 2 1259
## 3 1124
## 4 816
## 5 800
## 6 767
## 7 731
## 8 645
## 9 587
## 10 587
If we aggregate the number of citations per alumni, the 10 most cited alumni are listed below.
pubPerPerson <- pubRecords %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::summarize( numPublications=dplyr::n() )
citRates <- citHistory %>%
dplyr::group_by( GoogleScholarID, Name ) %>%
dplyr::summarize( numCitations=sum( cites ) )
citRates %>%
dplyr::arrange( desc(numCitations) ) %>%
head(10)
## # A tibble: 10 x 3
## # Groups: GoogleScholarID [10]
## GoogleScholarID Name numCitations
## <chr> <chr> <dbl>
## 1 YMcmOsAAAAAJ Claudia Gabriela Gonzaga Jáuregui 5402
## 2 aXchdQQAAAAJ María Gutiérrez Arcelus 3899
## 3 8QLuIWgAAAAJ Alejandro Reyes Quiroz 2899
## 4 WIgmpAMAAAAJ Miguel Enrique Rentería Rodríguez 2567
## 5 nQuXihQAAAAJ María del Carmen Avila Arcos 2535
## 6 7cF2WKQAAAAJ Alejandra Eugenia Medina Rivera 2025
## 7 3khb6PYAAAAJ José Víctor Moreno Mayar 2013
## 8 szU4z5UAAAAJ Sur Herrera Paredes 1797
## 9 Mbic02QAAAAJ Gabriel Cuellar Partida 1753
## 10 _S26cFEAAAAJ Fernando Alberto Rabanal Mora 1194
Another way of doing this ranking is to consider the average number citations per publication for each alumnus (i.e. dividing the number of citations over the number of publications). This operation is done by the code below. It turns out that I rank in the first place, this is because I co-authored the Bioconductor review mentioned above and a paper describing a statistical method that has been broadly used by the RNA-seq community. This ranking also brings Jaime A Castro-Mondragon to the top 10. Although he might have less papers than others in top raking positions (because he is much younger than the rest!), his work has been highly cited.
dplyr::left_join( citRates, pubPerPerson ) %>%
dplyr::mutate( rateCitations = numCitations/numPublications ) %>%
dplyr::arrange( desc(rateCitations) ) %>%
dplyr::filter( numPublications > 3 ) %>%
dplyr::ungroup() %>%
dplyr::select(Name, numPublications, numCitations, rateCitations) %>%
head(10)
## # A tibble: 10 x 4
## Name numPublications numCitations rateCitations
## <chr> <int> <dbl> <dbl>
## 1 Alejandro Reyes Quiroz 15 2899 193.
## 2 María Gutiérrez Arcelus 32 3899 122.
## 3 Jaime Abraham Castro Mondrag… 11 944 85.8
## 4 Fernando Alberto Rabanal Mora 15 1194 79.6
## 5 Mario Sandoval Calderón 8 636 79.5
## 6 Claudia Gabriela Gonzaga Jáu… 68 5402 79.4
## 7 Jorge Omar Yáñez Cuna 13 1009 77.6
## 8 Miguel Enrique Rentería Rodr… 34 2567 75.5
## 9 José Víctor Moreno Mayar 28 2013 71.9
## 10 Sur Herrera Paredes 25 1797 71.9
Finally, I am plotting the cumulative number of citations per alumni as a function of the time since graduation from the LCG. Each line is one alumnus and the red line is a loess regression curve. As expected, the number of citations increases with time.
pr <- citHistory %>%
dplyr::mutate( timeFromGrad=as.numeric(year-Graduation) ) %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::mutate( cumCites=cumsum(cites) )
loessFit <- loess(cumCites~timeFromGrad, data=pr)
citHistory %>%
dplyr::mutate( timeFromGrad=as.numeric(year-Graduation) ) %>%
dplyr::group_by( GoogleScholarID ) %>%
dplyr::mutate( cumCites=cumsum(cites) ) %>%
ggplot( aes( timeFromGrad/30/12, cumCites ) ) +
geom_point(alpha=0.6) +
geom_smooth(col="red", fill="red") +
geom_line( aes(group=GoogleScholarID), alpha=0.2 ) +
scale_y_sqrt( breaks=c(20, 100, 500, 1000, 2000, 3500, 5000),
minor_breaks=seq(100, 500, 100) ) +
labs(y="# of citations (cumulative)", x="Years from graduation from LCG")
According to the loess regression, the average alumni has around 303 citations after 7 years of graduation.
Conclusions
LCG’s impact on academic research is demonstrated by 938 manuscripts co-authored by alumni from the first six graduated classes. Most of these papers are published in biology journals. Altogether, these 938 papers have a total of 40995 citations and the median number of citations is 9. Note that the total number of papers and citations might be higher, since I only collected data from the first 6 classes and I was able to find Google scholar profiles for less than 50% of the alumni that I searched for.
At LCG, we were often told that we are better than other undergrad programs. As a scientist I like to support my statements with data: although the number of publications from LCG alumni might sound as excellent numbers, this analysis does not show that LCG alumni are better researchers than alumni from than other undergrad programs. In order to do this comparison, we would need to compare data from people graduated from other programs and normalize for differences between fields (for example, math papers are less cited than biology papers).
However, it is evident is that genomics research has been benefited from the creation of LCG and that the investment in more research-focused undergrad programs will benefit academic research.
Considerations and caveats of this analysis
- I am considering number of publications and citations to measure impact. This estimates don’t take into account other activities that are impactful, such as teaching, outreach and science communication.
- Talking about impactful researchers, I asked Leo Collado-Torres for feedback on this analysis. Among other great comments, he pointed out that Google Scholar also collects records that are not necessarily scientific publications. Google scholar records can include patents, conference abstracts, thesis, software manuals, etc. If a profile is not well curated, it could include duplicated entries (for example, two versions of the same manuscript, one posted in bioRxiv and the published version). This could affect the number of publications I describe throughout the analysis. He also suggested to analyze h-indexes and networks of both co-authors and citations: this will come in version 2 of this analysis.
- This analysis does not distinguish the degree of contributions to publications (e.g. first authorship, etc). It would be interesting to incorporate this information into the analysis.
- As mentioned before, these numbers are taken from alumni that have Google scholar profiles. As people with more publications will tend to have Google scholar profiles, the averages presented in this analysis are likely to be overestimated.
sessionInfo()
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cowplot_1.0.0 ggplot2_3.2.1 magrittr_1.5
## [4] scholar_0.1.7 googlesheets_0.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 cellranger_1.1.0 pillar_1.4.2
## [4] compiler_3.6.0 forcats_0.4.0 R.utils_2.9.0
## [7] R.methodsS3_1.7.1 tools_3.6.0 zeallot_0.1.0
## [10] digest_0.6.20 gtable_0.3.0 jsonlite_1.6
## [13] evaluate_0.14 tibble_2.1.3 R.cache_0.13.0
## [16] pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0
## [19] curl_4.0 yaml_2.2.0 xfun_0.9
## [22] withr_2.1.2 dplyr_0.8.3 httr_1.4.1
## [25] stringr_1.4.0 knitr_1.24 xml2_1.2.2
## [28] vctrs_0.2.0 askpass_1.1 hms_0.5.1
## [31] grid_3.6.0 tidyselect_0.2.5 glue_1.3.1
## [34] R6_2.4.0 fansi_0.4.0 rmarkdown_1.15
## [37] purrr_0.3.2 readr_1.3.1 ellipsis_0.2.0.1
## [40] scales_1.0.0 backports_1.1.4 htmltools_0.3.6
## [43] rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1
## [46] labeling_0.3 utf8_1.1.4 stringi_1.4.3
## [49] lazyeval_0.2.2 munsell_0.5.0 openssl_1.4.1
## [52] crayon_1.3.4 R.oo_1.22.0