COG0006 6 3 0 0 2 1 Xaa-Pro aminopeptidase COG0007 1 0 0 1 0 0 uroporphyrin-III C-methyltransferase COG0008 5 2 0 1 0 2 glutamyl-tRNA synthetase (EC 6.1.1.17) COG0009 5 2 1 0 1 1 translation factor SUA5 COG0010 1 0 0 0 1 0 arginase (EC 3.5.3.1) . . . . . . . . . . . . . . . . . . . . . . . .and create a data frame with the first column, the second column and the next N numeric columns after that, where N is specified by the user. Label the columns in the data frame "contig", "total", and P1, P2, ...
Solution: We use readLines() to read the file into an array of strings. This is fed into strsplit() which splits each string into a vector of whitespace-separated string values. The first of each subvector is extracted via sapply()-ing a subscript function, and other columns are separately extracted and converted to numbers with as.numeric() via another sapply(). Results are joined together with data.frame(), whose columns can be easily renamed with names()
Here's a more generalized version that will optionally read the column headings# c5 <- read_part_summary_file( "cogs.5.out", 5 ) read_gene_part_summary_file <- function( filename, nparts ) { str_v <- strsplit( readLines( filename ), " +" ) d <- data.frame( contig = sapply( str_v, function( r ) r[1] ), t( sapply( str_v, function( r ) as.numeric(r[2:(nparts+2)]) ) ) ) names( d ) <- c( "contig", "total", paste( "P", as.character(1:nparts), sep="") ) d }
read_vtable <- function( filename, nparts, header=F ) { str_v <- strsplit( readLines( filename ), " +" ) if ( header ) { ns <- (str_v[[1]])[1:nparts] str_v <- str_v[2:length(str_v)] } else ns <- paste( "V",as.character(1:nparts),sep="") print( ns ) d <- data.frame( contig = sapply( str_v, function( r ) r[1] ), t( sapply( str_v, function( r ) as.numeric(r[2:(nparts+2)]) ) ) ) names( d ) <- ns d }