Impact Of Genomics Undergrad

18 minute read

Published:

Impact of UNAM’s undergrad in Genomic Sciences on academic research

UNAM’s Undergrad program in Genomic Sciences (LCG, for its initials in Spanish) was created with the objective of providing students with the necessary background to develop genomics research. The program has a very strict selection criteria, involving two exams and one interview. Classes are small (20 to 30 people) compared to other UNAM programs. The first students graduated in August of 2009. I know that most alumni continue their education and further obtain masters and PhD degrees. Some of them enter prestigious PhD programs both in Mexico and abroad. Perhaps the most accomplished alumni, academically speaking, are around 5 to 10 graduates that now lead independent research groups. I think most these research groups are based at UNAM Juriquilla, Mexico. Others have accomplish a lot in their careers outside of academia. For example, Mariana Matus is the CEO and co-founder of a start-up company in Boston.

In any case, the objective of this document is to assess the impact that LCG has had in academic research, calculating the number of publications from LCG alumni as well as the impact of these publications.

The data

List of LCG alumni

In order to obtain the number of publication and citation statistics for each LCG alumni, I generated a Google spreadsheet that contains each student graduated by August 23th of 2019. The list of graduates, along with the date of graduation, was obtained from the LCG’s website. I manually filled the columns of the spreadsheet with Google Scholar identifiers and I tried to fill the class to which they belong when I could remember. I collected data for the first 6th classes that graduated from LCG. Then I got bored/tired and stopped. If anyone is interested in helping filling the spreadsheet further, please let me know. For now, the code below reads this compiled spreadsheet into R.

library(googlesheets)
gap <- gs_title( "GoogleGenomicos" ) ## you need to have this spreadsheet in your google drive and be logged in
dat <- gs_read( gap )
head( dat )
## # A tibble: 6 x 4
##   Name                              Graduation Generation GoogleScholarID
##   <chr>                             <date>          <dbl> <chr>          
## 1 Rocío Domínguez Vidaña            2007-09-14         NA C87y0MMAAAAJ   
## 2 Lucía Guadalupe Morales Reyes     2007-09-14         NA k6GlzNUAAAAJ   
## 3 Selene Lizbeth Fernández Valverde 2007-09-24         NA iHnkhAgAAAAJ   
## 4 Alejandra Eugenia Medina Rivera   2007-10-01         NA 7cF2WKQAAAAJ   
## 5 Estefanía García Ruiz             2007-10-04         NA <NA>           
## 6 Santiago Sandoval Motta           2007-10-04         NA 1Ud8j-UAAAAJ

Out of the data I compiled, I could find 73 profiles, which correspond to 26.74% of current graduates. These data corresponds to roughly 47.71 % of the students from the first six classes. Missing profiles are most probably not at random, as people with more publications are more likely to have a Google Scholar profile. This could result in inflated estimates and it is something to consider when reaching conclusions.

Publication numbers and citations

Having collected the Google scholar identifiers for LCG alumni, I am using the R CRAN package called scholar to scramble the data for each Google scholar profile. For each profile, I am obtaining the citation history, and the data of each of the publications. In each iteration, I am introducing Sys.sleep commands randomizing the number of seconds following normal distributions to sample waiting times between each query in order to avoid captchas.

library(scholar)

googleIDs <- dat$GoogleScholarID
googleIDs <- googleIDs[!is.na( googleIDs )]

scholarDataFile <- "scholarData.rds"
if( !file.exists(scholarDataFile) ){
  scholarData <- lapply( googleIDs, function(x){
    pubs <- try( get_publications(x) )
    Sys.sleep( abs( rnorm(1, 10, 5) ) )## randomize waiting times to avoid captcha's
    cits <- try( get_citation_history(x) )
    Sys.sleep( abs( rnorm(1, 60, 20) ) )
    list( pubRecord=pubs,
          citHistory=cits )
  } )
  names( scholarData ) <- googleIDs
  saveRDS( scholarData, file=scholarDataFile )
}
scholarData <- readRDS(scholarDataFile)

Then, I do some data wrangling to convert them to long-formatted data frames and start exploring these data.

pubRecords <- lapply( scholarData, "[[", "pubRecord" )
citHistory <- lapply( scholarData, "[[", "citHistory" )

stopifnot(all(vapply(pubRecords, class, character(1)) == "data.frame"))
stopifnot(all(vapply(citHistory, class, character(1)) == "data.frame"))


pubRecords <- purrr::map_df( pubRecords, ~as.data.frame(.x), .id="GoogleScholarID")
citHistory <- purrr::map_df( citHistory, ~as.data.frame(.x), .id="GoogleScholarID")

Publication records

Where do LCG alumni publish?

Where do LCG alumni publish? The code below computes the number of articles published in each journal. Then, it sorts the journals depending on the number of publications co-authored by LCG alumni. Below I am plotting the journals that appear most frequently among the publication profiles of LCG alumni.

library(magrittr)
library(ggplot2)
library(cowplot)

pubRecords <- pubRecords %>%
  dplyr::filter(journal != "")

theme_set(theme_cowplot())
pubPerJournal <- pubRecords %>%
  dplyr::select( journal, cid ) %>%
  unique() %>%
  dplyr::group_by( journal=tolower(journal) ) %>%
  dplyr::summarise( numb=dplyr::n() ) %>%
  dplyr::arrange( desc(numb) ) %>%
  dplyr::filter( numb > 5, journal != "" )

pubPerJournal$journal <- forcats::fct_reorder(pubPerJournal$journal, pubPerJournal$numb, .desc=TRUE)

levels(pubPerJournal$journal) <-
  gsub("proceedings of the national academy of sciences", "pnas", levels( pubPerJournal$journal) )
levels(pubPerJournal$journal) <-
  gsub("the american journal of human genetics", "ajhg", levels( pubPerJournal$journal) )

pubPerJournal %>%
  ggplot( aes( journal, numb ) ) +
  geom_point() +
  theme(axis.text.x=element_text(angle=35, hjust=1)) +
  labs(y="Number of publications", x="Journal") +
  ylim(0, 29)

It is interesting that LCG alumni appear in 27 biorXiv preprints, the same number of Nature and PLOS ONE papers. As a second group of journals, we have PNAS, Nucleic Acids Research and Nat Communications with an average of 19 papers. Not surprisingly, the most frequent journals publish a lot of research in genomics.

Number of publications of LCG alumni

Now it is a good time to remember that our sample of profiles might be biased towards inflated publication numbers. Nevertheless, the histogram below shows the distribution on the number of publications per alumnus:

pubRecords <- dplyr::left_join( pubRecords, dat )

pubRecords %>%
  dplyr::group_by( GoogleScholarID ) %>%
  dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation) ) %>%
  ggplot(aes(number)) +
  geom_histogram( bins=30 ) +
  labs(x="Number of publications", y="Frequency")

The distribution above has a long tail. Obviously, the more time a person stays in research, the more publications that person will have. Thus, a more informative plot is one that also considers the date of graduation. To do so, I plotted the number of publications as a function of the years since graduation.

pubRecords %>%
  dplyr::group_by( GoogleScholarID ) %>%
  dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation) ) %>%
  dplyr::mutate( timeFromGrad=(Sys.Date()- graduation) ) %>%
  ggplot( aes( timeFromGrad/30/12, number ) ) +
  geom_point(alpha=0.5) +
  labs(x="Time since graduation (years)", y="Number of publications")

There seems to be quite some variability. There are people that have many papers! The variability in the plot above could reflect differences in the collaborative environment between fields or institutions. For example, people working in more collaborative fields such as consortia will have more papers. Who are the alumni with the highest number of publications? Below is top 10 ranking, which indicates us that the clear outlier from the plots above corresponds to Claudia Gonzaga-Jauregui from the first LCG class, who has co-authored 68 papers.

pubRecords %>%
  dplyr::group_by( GoogleScholarID ) %>%
  dplyr::summarize( number=dplyr::n(), graduation=unique(Graduation), name=unique(Name) ) %>%
  dplyr::arrange( desc(number) ) %>%
  head(10)
## # A tibble: 10 x 4
##    GoogleScholarID number graduation name                             
##    <chr>            <int> <date>     <chr>                            
##  1 YMcmOsAAAAAJ        68 2007-11-28 Claudia Gabriela Gonzaga Jáuregui
##  2 nQuXihQAAAAJ        43 2009-07-02 María del Carmen Avila Arcos     
##  3 Mbic02QAAAAJ        41 2011-08-01 Gabriel Cuellar Partida          
##  4 Zkyg60AAAAAJ        39 2007-10-17 Gabriela Angélica Martínez Nava  
##  5 h57-MykAAAAJ        37 2009-06-03 Leonardo Collado Torres          
##  6 WIgmpAMAAAAJ        34 2008-06-25 Miguel Enrique Rentería Rodríguez
##  7 aXchdQQAAAAJ        32 2009-06-03 María Gutiérrez Arcelus          
##  8 7cF2WKQAAAAJ        30 2007-10-01 Alejandra Eugenia Medina Rivera  
##  9 sqH-GCQAAAAJ        29 2009-01-09 Angélica Paola Hernández Pérez   
## 10 3khb6PYAAAAJ        28 2011-10-06 José Víctor Moreno Mayar

Fifth generation compared to others

I have the feeling that my class, also known as La Quinta, has done better in publishing compared to other classes. In order to test this hypothesis, I filtered the data for only those alumni that graduated before the first alumni from the 5th class. To account for the time each person has spent in research, I compared the number of publications after 7 years of graduation. If we subject my feeling to hypothesis testing, we see that there is not a significant difference in the number of publications between alumni from La Quinta and other classes. My feeling seems to be only that: a feeling.

pubRecords$year <- as.Date(paste0( pubRecords$year, "-01-01"))

firstQuinto <- max(dat$Graduation[which(dat$Generation == 5)])

pubRecords %>%
  dplyr::filter( Graduation - firstQuinto < 0 | Generation == 5, year - Graduation < 365*7) %>%
  dplyr::mutate( Generation=ifelse(Generation != 5 | is.na(Generation), "1st-4th", "5th") ) %>%
  dplyr::group_by( GoogleScholarID ) %>%
  dplyr::summarize( numbs=dplyr::n(), Generation=unique(Generation) ) %>%
  wilcox.test( numbs ~ Generation, data=. )
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  numbs by Generation
## W = 365.5, p-value = 0.5042
## alternative hypothesis: true location shift is not equal to 0

Impact of research as measured by citation numbers

One way the impact of a publication is assessed is by the number citations in literature. Nevertheless, people have argued that the most impactful research is not reflected in the number of citations 1 and the citation rates vary from field to field. For example, papers in human genomics are cited more frequently than plant genomics papers. Anyway, if we explore the distribution of citations per paper, we observed the typical distribution of citations counts, which has been the subject of some statistics papers2.

citHistory <- dplyr::left_join( citHistory, dat )
citHistory$year <- as.Date(paste0( citHistory$year, "-12-31"))
hist( pubRecords$cites, 100, xlab="Number of citations", main="" )