dataframe - Page 3 - Developer IT

How to find subgroups statistics in pandas?

- by user2808117

I am grouping a DataFrame using multiple columns (e.g., columns A, B - my_df.groupby(['A','B']) ), is there a better (less lines of code, faster) way of finding how many rows are in each subgroup and how many subgroups are there in total? at the moment I am using: def get_grp_size(grp): grp['size'] = len(grp) return grp my_df = my_df.groupby(['A','B']).apply(get_grp_size) my_df[['A','B','size']].drop_duplicates().size

Read the article

pandas read rotated csv files

- by EricCoding

Is there any function in pandas that can directly read a rotated csv file? To be specific, the header information in the first col instead of the first row. For example: A 1 2 B 3 5 C 6 7 and I would like the final DataFrame this way A B C 1 3 5 2 5 7 Of corse we can get around this problem using some data wangling techniques like transpose and slicing. I am wondering there should be a quick way in API but I could not find it.

Read the article

replace NA in an R vector with adjacent values

- by pssguy

I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA), player=c(NA,"Ed",NA,"Ed","Ed",NA), playerGame= c(NA,1,NA,2,3,NA), teamGame =c(1,2,3,4,5,6)) Where the NA's indicate that the player did not appear in that specific team game How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3

Read the article

Python Pandas operate on row

- by wuha

Hi my dataframe look like: Store,Dept,Date,Sales 1,1,2010-02-05,245 1,1,2010-02-12,449 1,1,2010-02-19,455 1,1,2010-02-26,154 1,1,2010-03-05,29 1,1,2010-03-12,239 1,1,2010-03-19,264 Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not. Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.

Read the article

In R: How do I choose the first 4 Rows of a Data Frame

- by Moe

How Do I go about getting the first 4 rows of my Dataframe: Weight Response 1 Control 59 0.0 2 Treatment 90 0.8 3 Treatment 47 0.1 4 Treamment 106 0.1 5 Control 85 0.7 6 Treatment 73 0.6 7 Control 61 0.2 In 'R'?

Read the article

Creating a series of vectors from a vector

- by bluetongue

I have a simple two vector dataframe (length=30) that looks something like this: > mDF Param1 w.IL.L 1 AuZgFw 0.5 2 AuZfFw 2 3 AuZgVw 74.3 4 AuZfVw 20.52 5 AuTgIL 80.9 6 AuTfIL 193.3 7 AuCgFL 0.2 8 ... I'd like to use each of the rows to form 30 single value numeric vectors with the name of the vector taken from mDF$Param1, so that: > AuZgFw [1] 0.5 etc I've tried melting and casting, but I suspect there may be an easier way?? Thanks in advance BT

Read the article

how to use ggplot conditional on data

- by Andreas

I asked this question and it seams ggplot2 currently has a bug with empty data.frames. Therefore I am trying to check if the dataframe is empty, before I make the plot. But what ever I come up with, it gets really ugly, and doesn't work. So I am asking for your help. example data: SOdata <- structure(list(id = 10:55, one = c(7L, 8L, 7L, NA, 7L, 8L, 5L, 7L, 7L, 8L, NA, 10L, 8L, NA, NA, NA, NA, 6L, 5L, 6L, 8L, 4L, 7L, 6L, 9L, 7L, 5L, 6L, 7L, 6L, 5L, 8L, 8L, 7L, 7L, 6L, 6L, 8L, 6L, 8L, 8L, 7L, 7L, 5L, 5L, 8L), two = c(7L, NA, 8L, NA, 10L, 10L, 8L, 9L, 4L, 10L, NA, 10L, 9L, NA, NA, NA, NA, 7L, 8L, 9L, 10L, 9L, 8L, 8L, 8L, 8L, 8L, 9L, 10L, 8L, 8L, 8L, 10L, 9L, 10L, 8L, 9L, 10L, 8L, 8L, 7L, 10L, 8L, 9L, 7L, 9L), three = c(7L, 10L, 7L, NA, 10L, 10L, NA, 10L, NA, NA, NA, NA, 10L, NA, NA, 4L, NA, 7L, 7L, 4L, 10L, 10L, 7L, 4L, 7L, NA, 10L, 4L, 7L, 7L, 7L, 10L, 10L, 7L, 10L, 4L, 10L, 10L, 10L, 4L, 10L, 10L, 10L, 10L, 7L, 10L), four = c(7L, 10L, 4L, NA, 10L, 7L, NA, 7L, NA, NA, NA, NA, 10L, NA, NA, 4L, NA, 10L, 10L, 7L, 10L, 10L, 7L, 7L, 7L, NA, 10L, 7L, 4L, 10L, 4L, 7L, 10L, 2L, 10L, 4L, 12L, 4L, 7L, 10L, 10L, 12L, 12L, 4L, 7L, 10L), five = c(7L, NA, 6L, NA, 8L, 8L, 7L, NA, 9L, NA, NA, NA, 9L, NA, NA, NA, NA, 7L, 8L, NA, NA, 7L, 7L, 4L, NA, NA, NA, NA, 5L, 6L, 5L, 7L, 7L, 6L, 9L, NA, 10L, 7L, 8L, 5L, 7L, 10L, 7L, 4L, 5L, 10L), six = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("2010-05-25", "2010-05-27", "2010-06-07"), class = "factor"), seven = c(0.777777777777778, 0.833333333333333, 0.333333333333333, 0.888888888888889, 0.5, 0.888888888888889, 0.777777777777778, 0.722222222222222, 0.277777777777778, 0.611111111111111, 0.722222222222222, 1, 0.888888888888889, 0.722222222222222, 0.555555555555556, NA, 0, 0.666666666666667, 0.666666666666667, 0.833333333333333, 0.833333333333333, 0.833333333333333, 0.833333333333333, 0.722222222222222, 0.833333333333333, 0.888888888888889, 0.666666666666667, 1, 0.777777777777778, 0.722222222222222, 0.5, 0.833333333333333, 0.722222222222222, 0.388888888888889, 0.722222222222222, 1, 0.611111111111111, 0.777777777777778, 0.722222222222222, 0.944444444444444, 0.555555555555556, 0.666666666666667, 0.722222222222222, 0.444444444444444, 0.333333333333333, 0.777777777777778), eight = c(0.666666666666667, 0.333333333333333, 0.833333333333333, 0.666666666666667, 1, 1, 0.833333333333333, 0.166666666666667, 0.833333333333333, 0.833333333333333, 1, 1, 0.666666666666667, 0.666666666666667, 0.333333333333333, 0.5, 0, 0.666666666666667, 0.5, 1, 0.666666666666667, 0.5, 0.666666666666667, 0.666666666666667, 0.666666666666667, 0.333333333333333, 0.333333333333333, 1, 0.666666666666667, 0.833333333333333, 0.666666666666667, 0.666666666666667, 0.5, 0, 0.833333333333333, 1, 0.666666666666667, 0.5, 0.666666666666667, 0.666666666666667, 0.5, 1, 0.833333333333333, 0.666666666666667, 0.833333333333333, 0.666666666666667), nine = c(0.307692307692308, NA, 0.461538461538462, 0.538461538461538, 1, 0.769230769230769, 0.538461538461538, 0.692307692307692, 0, 0.153846153846154, 0.769230769230769, NA, 0.461538461538462, NA, NA, NA, NA, 0, 0.615384615384615, 0.615384615384615, 0.769230769230769, 0.384615384615385, 0.846153846153846, 0.923076923076923, 0.615384615384615, 0.692307692307692, 0.0769230769230769, 0.846153846153846, 0.384615384615385, 0.384615384615385, 0.461538461538462, 0.384615384615385, 0.461538461538462, NA, 0.923076923076923, 0.692307692307692, 0.615384615384615, 0.615384615384615, 0.769230769230769, 0.0769230769230769, 0.230769230769231, 0.692307692307692, 0.769230769230769, 0.230769230769231, 0.769230769230769, 0.615384615384615), ten = c(0.875, 0.625, 0.375, 0.75, 0.75, 0.75, 0.625, 0.875, 1, 0.125, 1, NA, 0.625, 0.75, 0.75, 0.375, NA, 0.625, 0.5, 0.75, 0.875, 0.625, 0.875, 0.75, 0.625, 0.875, 0.5, 0.75, 0, 0.5, 0.875, 1, 0.75, 0.125, 0.5, 0.5, 0.5, 0.625, 0.375, 0.625, 0.625, 0.75, 0.875, 0.375, 0, 0.875), elleven = c(1, 0.8, 0.7, 0.9, 0, 1, 0.9, 0.5, 0, 0.8, 0.8, NA, 0.8, NA, NA, 0.8, NA, 0.4, 0.8, 0.5, 1, 0.4, 0.5, 0.9, 0.8, 1, 0.8, 0.5, 0.3, 0.9, 0.2, 1, 0.8, 0.1, 1, 0.8, 0.5, 0.2, 0.7, 0.8, 1, 0.9, 0.6, 0.8, 0.2, 1), twelve = c(0.666666666666667, NA, 0.133333333333333, 1, 1, 0.8, 0.4, 0.733333333333333, NA, 0.933333333333333, NA, NA, 0.6, 0.533333333333333, NA, 0.533333333333333, NA, 0, 0.6, 0.533333333333333, 0.733333333333333, 0.6, 0.733333333333333, 0.666666666666667, 0.533333333333333, 0.733333333333333, 0.466666666666667, 0.733333333333333, 1, 0.733333333333333, 0.666666666666667, 0.533333333333333, NA, 0.533333333333333, 0.6, 0.866666666666667, 0.466666666666667, 0.533333333333333, 0.333333333333333, 0.6, 0.6, 0.866666666666667, 0.666666666666667, 0.6, 0.6, 0.533333333333333)), .Names = c("id", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "elleven", "twelve"), class = "data.frame", row.names = c(NA, -46L)) And the plot iqr <- function(x, ...) { qs <- quantile(as.numeric(x), c(0.25, 0.5, 0.75), na.rm = T) names(qs) <- c("ymin", "y", "ymax") qs } magic <- function(y, ...) { high <- median(SOdata[[y]], na.rm=T)+1.5*sd(SOdata[[y]],na.rm=T) low <- median(SOdata[[y]], na.rm=T)-1.5*sd(SOdata[[y]],na.rm=T) ggplot(SOdata, aes_string(x="six", y=y))+ stat_summary(fun.data="iqr", geom="crossbar", fill="grey", alpha=0.3)+ geom_point(data = SOdata[SOdata[[y]] > high,], position=position_jitter(w=0.1, h=0),col="green", alpha=0.5)+ geom_point(data = SOdata[SOdata[[y]] < low,], position=position_jitter(w=0.1, h=0),col="red", alpha=0.5)+ stat_summary(fun.y=median, geom="point",shape=18 ,size=4, col="orange") } for (i in names(SOdata)[-c(1,7)]) { p<- magic(i) ggsave(paste("magig_plot_",i,".png",sep=""), plot=p, height=3.5, width=5.5) } The problem is that sometimes in the call to geom_point the subset returns an empty dataframe, which sometimes (!) causes ggplot2 to plot all the data instead of none of the data. geom_point(data = SOdata[SOdata[[y]] > high,], position=position_jitter(w=0.1, h=0),col="green", alpha=0.5)+ This is kindda of important to me, and I am really stuck trying to find a solution. Any help that will get me started is much appreciated. Thanks in advance.

Read the article

How to put a newline into a column header in an xtable in R

- by PaulHurleyuk

I have a dataframe that I am putting into a sweave document using xtable, however one of my column names is quite long, and I would like to break it over two lines to save space calqc_table<-structure(list(RUNID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ANALYTEINDEX = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), ID = structure(1:11, .Label = c("Cal A", "Cal B", "Cal C", "Cal D", "Cal E", "Cal F", "Cal G", "Cal H", "Cal High", "Cal Low", "Cal Mid"), class = "factor"), mean_conc = c(200.619459644855, 158.264703128903, 102.469121407733, 50.3551544728544, 9.88296440865076, 4.41727762501703, 2.53494715706024, 1.00602831741361, 199.065054555735, 2.48063347296935, 50.1499780776199), sd_conc = c(2.3275711264554, NA, NA, NA, NA, NA, NA, 0.101636943231162, 0, 0, 0), nrow = c(3, 1, 1, 1, 1, 1, 1, 3, 2, 2, 2)), .Names = c("Identifier of the Run within the Study", "ANALYTEINDEX", "ID", "mean_conc", "sd_conc", "nrow"), row.names = c(NA, -11L ), class = "data.frame") calqc_xtable<-xtable(calqc_table) I have tried putting a newline into the name, but this didn't seem to work names(calqc_table)[1]<-"Identifier of the \nRun within the Study" Is there a way to do this ? I have seen someone suggest using the latex function from the hmisc package to manually iterate over the table and write it out in latex manually, including the newline, but this seems like a bit of a faf !

Read the article

Utilise Surv object in ggplot or lattice

- by Misha

Anyone know how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do trellis/facet like survival graphs. So in the end I played around and sort of found a solution for a kaplan meier plot. Apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way. Note: It only works with two levels of stratum. If anyone know how I can use x<-length(stratum) to do this please let me know (in stata I could append to a macro-unsure how this works in R)... ggkm<-function(time,event,stratum) { m2s<-Surv(time,as.numeric(event)) fit <- survfit(m2s ~ stratum) f$time<-fit$time f$surv<-fit$surv f$strata<-c(rep(names(fit$strata[1]),fit$strata[1]),rep(names(fit$strata[2]),fit$strata[2])) f$upper<-fit$upper f$lower<-fit$lower r<-ggplot (f,aes(x=time,y=surv,fill=strata,group=strata))+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3) return(r) }

Read the article

Packet Crafting on a Mac

- by JayCrossler

I think (based on searching the forums), that NetCat is the best option, but wanted to hear if others are preferable. Anyone have good success with a packet-crafting tool (specifically on Mac)? I've tried HPing, but had some issues. I'm looking into NetCat ('nc' on mac) now, but it's not working as I had thought. Basically, I captured some packets that a remote control sends over a wifi network to turn lights on and off (using X10 controllers), and am looking for a way to replay them by crafting a TCP packet from the command line. I used Wireshark to sniff the traffic, so I know the package is: DEVICE -sendplc-"C4 DIM 10" I'm trying: echo 'DEVICE -sendplc-"C4 DIM 10"' nc 192.168.2.196 6003 but there's no response from the receiving system. The exact bytestream is (if anyone wants to check that I got the right dataframe): 08004642f1b400260897ad6308004500004b08e240004006aaf5c0a802c1c0a802c4d8d7177399aab39 e57ff4753801880ae37ea00000101080a323353ce01b406424445564943457e73656e64706c637e2243 34204f4e220a Next step I'm going to check is to make sure the packet arrives exactly by sniffing and compare it to the original. [EDIT: Also at ServerFault now: Packet Crafting on a Mac, so this one becomes a duplicate across the two sites]

Read the article

Convert object to DateRange

- by user655832

I'm querying an underlying PostgreSQL database using Pandas 0.8. Pandas is returning the DataFrame properly but the underlying timestamp column in my database is being returned as a generic "object" type in Pandas. As I would eventually like to seasonal normalization of my data I am curious as to how to convert this generic "object" column to something that is appropriate for analysis. Here is my current code to retrieve the data: # get records from db example import pandas.io.sql as psql import psycopg2 # define query to get all subs created this year QRY = """ select i i, i * random() f, case when random() > 0.5 then true else false end t, (current_date - (i*random())::int)::timestamp with time zone tsz from generate_series(1,1000) as s(i) order by 4 ; """ CONN_STRING = "host='localhost' port=5432 dbname='postgres' user='postgres'" # connect to db conn = psycopg2.connect(CONN_STRING) # get some data set index on relid column df = psql.frame_query(QRY, con=conn) print "Row count retrieved: %i" % (len(df),) Thanks for any help you can render. M

Read the article

How do I introduce row names to a function in R

- by Tahnoon Pasha

Hi I have a utility function I've put together to insert rows into a dataframe below. If I was writing out the formula by hand I would put something like newframe=rbind(oldframe[1:rownum,],row_to_insert=row_to_insert,oldframe[(rownum+1:nrow(oldframe),] to name row_to_insert. Could someone tell me how to do this in a function? Thanks insertrows=function (x, y, rownum) { newframe = rbind(y[1:rownum, ], x, y[(rownum + 1):nrow(y), ]) return(data.frame(newframe)) } MWE of some underlying data added below financials=data.frame(sales=c(100,150,200,250),some.direct.costs=c(25,30,35,40),other.direct.costs=c(15,25,25,35),indirect.costs=c(40,45,45,50)) oldframe=t(financials) colnames(oldframe)=make.names(seq(2000,2003,1)) total.direct.costs=oldframe['some.direct.costs',]+oldframe['other.direct.costs',] newframe=total.direct.costs n=rownum=3 oldframe=insertrows(total.direct.costs=newframe,oldframe,n)

Read the article

R: Using sapply on vector of POSIXct

- by Chris

I have what may be a very simple question. I want to process a column of POSIXct objects from a dataframe and generate a vector of datetime strings. I tried to use the following sapply call dt <- sapply(df$datetime, function(x) format(x,"%Y-%m-%dT%H:%M:%S")) but to no avail. I keep getting the following error Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid 'trim' argument When I apply this function to a single POSIXct object from the column, I have no problem. So I'm stumped at the moment about what the problem is. Do I need to do something special with POSIXct objects?

Read the article

Matlab: Randomize and Split

- by Null-Hypothesis

I have following data matrix in Matlab, I am trying to actually split this into multiple segments by passing a variable to a matlab function. But before splitting I would like to shuffle the matrix. The size of my matrix is 150X4 s.data 5.1000 3.5000 1.4000 0.2000 4.9000 3.0000 1.4000 0.2000 4.7000 3.2000 1.3000 0.2000 4.6000 3.1000 1.5000 0.2000 5.0000 3.6000 1.4000 0.2000 .. s = data: [150x4 double] labels: [150x1 double] Coming from R environment I find MatLab is very strange. Initially I thought the columns in matrix has a relationshop like in a R dataframe but thats wrong in my assumption.

Read the article

Create categorical variable in R based on range

- by Stedy

I have a dataframe with a column of integers that I would like to use as a reference to make a new categorical variable. I want to divide the variable into three groups and set the ranges myself (ie 0-5, 6-10, etc). I tried cut but that divides the variable into groups based on a normal distribution and my data is right skewed. I have also tried to use if/then statements but this outputs a true/false value and I would like to keep my original variable. I am sure that there is a simple way to do this but I cannot seem to figure it out. Any advice on a simple way to do this quickly?

Read the article

Select only the first rows for each unique value of a column in R

- by dmvianna

From a dataframe like this test <- data.frame('id'= rep(1:5,2), 'string'= LETTERS[1:10]) test <- test[order(test$id), ] rownames(test) <- 1:10 > test id string 1 1 A 2 1 F 3 2 B 4 2 G 5 3 C 6 3 H 7 4 D 8 4 I 9 5 E 10 5 J I want to create a new one with the first appearance of each id / string pair. If sqldf accepted R code within it, the query could look like this: res <- sqldf("select id, min(rownames(test)), string from test group by id, string") > res id string 1 1 A 3 2 B 5 3 C 7 4 D 9 5 E Is there a solution short of creating a new column like test$row <- rownames(test) and running the same sqldf query with min(row)?

Read the article

Create categorial variable in R based on range

- by Stedy

I have a dataframe with a column of integers that I would like to use as a reference to make a new categorical variable. I want to divide the variable into three groups and set the ranges myself (ie 0-5, 6-10, etc). I tried cut but that divides the variable into groups based on a normal distribution and my data is right skewed. I have also tried to use if/then statements but this outputs a true/false value and I would like to keep my original variable. I am sure that there is a simple way to do this but I cannot seem to figure it out. Any advice on a simple way to do this quickly?

Read the article

Writing a script for reading many .csv files with similar filenames

- by wahalulu

I have several .csv files with similar filenames except a numeric month (i.e. 03_data.csv, 04_data.csv, 05_data.csv, etc.) that I'd like to read into R. I have two questions: Is there a function in R similar to MATLAB's varname and assignin that will let me create/declare a variable name within a function or loop that will allow me to read the respective .csv file - i.e. 03_data.csv into 03_data data.frame, etc.? I want to write a quick loop to do this because the filenames are similar. As an alternative, is it better to create one dataframe with the first file and then append the rest using a for loop? How would I do that?

Read the article

In R draw two lines, with slopes double and half the value of the best fit line

- by D W

I have data with a best fit line draw. I need to draw two other lines. One needs to have double the slope and the other need to have half the slope. Later I will use the region to differentially color points outside it as per: http://stackoverflow.com/questions/2687212/conditionally-colour-data-points-outside-of-confidence-bands-in-r Example dataset: ## Dataset from http://www.apsnet.org/education/advancedplantpath/topics/RModules/doc1/04_Linear_regression.html ## Disease severity as a function of temperature # Response variable, disease severity diseasesev<-c(1.9,3.1,3.3,4.8,5.3,6.1,6.4,7.6,9.8,12.4) # Predictor variable, (Centigrade) temperature<-c(2,1,5,5,20,20,23,10,30,25) ## For convenience, the data may be formatted into a dataframe severity <- as.data.frame(cbind(diseasesev,temperature)) ## Fit a linear model for the data and summarize the output from function lm() severity.lm <- lm(diseasesev~temperature,data=severity) # Take a look at the data plot( diseasesev~temperature, data=severity, xlab="Temperature", ylab="% Disease Severity", pch=16, pty="s", xlim=c(0,30), ylim=c(0,30) ) title(main="Graph of % Disease Severity vs Temperature") par(new=TRUE) # don't start a new plot abline(severity.lm, col="blue")

Read the article

compare two characters based on subset

- by schultem

I have a simple dataframe with two columns: df <- data.frame(x = c(1,1,2,2,3), y = c(rep(1:2,2),1), target = c('a','a','a','b','a')) I would like to compare the strings in the target column (find out whether they are equal or not, i.e., TRUE or FALSE) within every level of x (same number for x). First I would like to compare lines 1 and 2, then 3 and 4 ... My problem is that I am missing some comparisons, for example, line 5 has only one case instead of two - so it should turn out to be FALSE. Variable y indicates the first and second case within x. I played around with ddply doing something like: ddply(df, .(x), summarise, ifelse(as.character(df[df$y == '1',]$target), as.character(df[df$y == '2',]$target),0,1)) which is ugly ... and does not work ... Any insights how I could achieve this comparison? Thanks

Read the article

improve my code for collapsing a list of data.frames

- by romunov

Dear StackOverFlowers (flowers in short), I have a list of data.frames (walk.sample) that I would like to collapse into a single (giant) data.frame. While collapsing, I would like to mark (adding another column) which rows have came from which element of the list. This is what I've got so far. This is the data.frame that needs to be collapsed/stacked. > walk.sample [[1]] walker x y 1073 3 228.8756 -726.9198 1086 3 226.7393 -722.5561 1081 3 219.8005 -728.3990 1089 3 225.2239 -727.7422 1032 3 233.1753 -731.5526 [[2]] walker x y 1008 3 205.9104 -775.7488 1022 3 208.3638 -723.8616 1072 3 233.8807 -718.0974 1064 3 217.0028 -689.7917 1026 3 234.1824 -723.7423 [[3]] [1] 3 [[4]] walker x y 546 2 629.9041 831.0852 524 2 627.8698 873.3774 578 2 572.3312 838.7587 513 2 633.0598 871.7559 538 2 636.3088 836.6325 1079 3 206.3683 -729.6257 1095 3 239.9884 -748.2637 1005 3 197.2960 -780.4704 1045 3 245.1900 -694.3566 1026 3 234.1824 -723.7423 I have written a function to add a column that denote from which element the rows came followed by appending it to an existing data.frame. collapseToDataFrame <- function(x) { # collapse list to a dataframe with a twist walk.df <- data.frame() for (i in 1:length(x)) { n.rows <- nrow(x[[i]]) if (length(x[[i]])>1) { temp.df <- cbind(x[[i]], rep(i, n.rows)) names(temp.df) <- c("walker", "x", "y", "session") walk.df <- rbind(walk.df, temp.df) } else { cat("Empty list", "\n") } } return(walk.df) } > collapseToDataFrame(walk.sample) Empty list Empty list walker x y session 3 1 -604.5055 -123.18759 1 60 1 -562.0078 -61.24912 1 84 1 -594.4661 -57.20730 1 9 1 -604.2893 -110.09168 1 43 1 -632.2491 -54.52548 1 1028 3 240.3905 -724.67284 1 1040 3 232.5545 -681.61225 1 1073 3 228.8756 -726.91980 1 1091 3 209.0373 -740.96173 1 1036 3 248.7123 -694.47380 1 I'm curious whether this can be done more elegantly, with perhaps do.call() or some other more generic function?

Read the article

How to include multiple tables programmaticaly into a Sweave document using R

- by PaulHurleyuk

Hello, I want to have a sweave document that will include a variable number of tables in. I thought the example below would work, but it doesn't. I want to loop over the list foo and print each element as it's own table. % \documentclass[a4paper]{article} \usepackage[OT1]{fontenc} \usepackage{longtable} \usepackage{geometry} \usepackage{Sweave} \geometry{left=1.25in, right=1.25in, top=1in, bottom=1in} \listfiles \begin{document} <<label=start, echo=FALSE, include=FALSE>>= startt<-proc.time()[3] library(RODBC) library(psych) library(xtable) library(plyr) library(ggplot2) options(width=80) #Produce some example data, here I'm creating some dummy dataframes and putting them in a list foo<-list() foo[[1]]<-data.frame(GRP=c(rep("AA",10), rep("Aa",10), rep("aa",10)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[2]]<-data.frame(GRP=c(rep("BB",10), rep("bB",10), rep("BB",10)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[3]]<-data.frame(GRP=c(rep("CC",12), rep("cc",18)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[4]]<-data.frame(GRP=c(rep("DD",10), rep("Dd",10), rep("dd",10)), X1=rnorm(30), X2=rnorm(30,5,2)) @ \title{Docuemnt to test putting a variable number of tables into a sweave Document} \author{"Paul Hurley"} \maketitle \section{Text} This document was created on \today, with \Sexpr{print(version$version.string)} running on a \Sexpr{print(version$platform)} platform. It took approx \input{time} sec to process. <<label=test, echo=FALSE, results=tex>>= cat("Foo") @ that was a test, so is this <<label=table1test, echo=FALSE, results=tex>>= print(xtable(foo[[1]])) @ \newpage \subsection{Tables} <<label=Tables, echo=FALSE, results=tex>>= for(i in seq(foo)){ cat("\n") cat(paste("Table_",i,sep="")) cat("\n") print(xtable(foo[[i]])) cat("\n") } #cat("<<label=endofTables>>= ") @ <<label=bye, include=FALSE, echo=FALSE>>= endt<-proc.time()[3] elapsedtime<-as.numeric(endt-startt) @ <<label=elapsed, include=FALSE, echo=FALSE>>= fileConn<-file("time.tex", "wt") writeLines(as.character(elapsedtime), fileConn) close(fileConn) @ \end{document} Here, the table1test chunk works as expected, and produced a table based on the dataframe in foo[[1]], however the loop only produces Table(underscore)1.... Any ideas what I'm doing wrong ?

Read the article

How to create a column containing a string of stars to inidcate levels of a factor in a data frame i

- by PaulHurleyuk

(second question today - must be a bad day) I have a dataframe with various columns, inculding a concentration column (numeric), a flag highlighting invalid results (boolean) and a description of the problem (character) dput(df) structure(list(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), rawconc = c(77.4, 52.6, 86.5, 44.5, 167, 16.2, 59.3, 123, 1.95, 181), reason = structure(c(NA, NA, 2L, NA, NA, NA, 2L, 1L, NA, NA), .Label = c("Fails Acceptance Criteria", "Poor Injection"), class = "factor"), flag = c("False", "False", "True", "False", "False", "False", "True", "True", "False", "False" )), .Names = c("x", "rawconc", "reason", "flag"), row.names = c(NA, -10L), class = "data.frame") I can create a column with the numeric level of the reason column df$level<-as.numeric(df$reason) df x rawconc reason flag level 1 1 77.40 <NA> False NA 2 2 52.60 <NA> False NA 3 3 86.50 Poor Injection True 2 4 4 44.50 <NA> False NA 5 5 167.00 <NA> False NA 6 6 16.20 <NA> False NA 7 7 59.30 Poor Injection True 2 8 8 123.00 Fails Acceptance Criteria True 1 9 9 1.95 <NA> False NA 10 10 181.00 <NA> False NA and here's what I want to do to create a column with 'level' many stars, but it fails df$stars<-paste(rep("*",df$level)sep="",collapse="") Error: unexpected symbol in "df$stars<-paste(rep("*",df$level)sep" df$stars<-paste(rep("*",df$level),sep="",collapse="") Error in rep("*", df$level) : invalid 'times' argument rep("*",df$level) Error in rep("*", df$level) : invalid 'times' argument df$stars<-paste(rep("*",pmax(df$level,0,na.rm=TRUE)),sep="",collapse="") Error in rep("*", pmax(df$level, 0, na.rm = TRUE)) : invalid 'times' argument It seems that rep needs to be fed one value at a time. I feel that this should be possible (and my gut says 'use lapply' but my apply fu is v. poor) ANy one want to try ?

Read the article

subtotals in columns usind reshape2 in R

- by user1043144

I have spent some time now learning RESHAPE2 and plyr but I still do not get it. This time I have a problem with (a) subtotals and (b) passing different aggregate functions . Here an example using data from the excellent tutorial on the blog of mrdwab http://news.mrdwab.com/ # libraries library(plyr) library(reshape2) # get data and add few more variables book.sales = read.csv("http://news.mrdwab.com/data-booksales") book.sales$Stock = book.sales$Quantity + 10 book.sales$SubjCat[(book.sales$Subject == 'Economics') | (book.sales$Subject == 'Management') ] <- '1_EconSciences' book.sales$SubjCat[book.sales$Subject %in% c('Anthropology', 'Politics', 'Sociology') ] <- '2_SocSciences' book.sales$SubjCat[book.sales$Subject %in% c('Communication', 'Fiction', 'History', 'Research', 'Statistics') ] <- '3_other' # to get to my starting dataframe (close to the project I am working on) book.sales1 <- ddply(book.sales, c('Region', 'Representative', 'SubjCat', 'Subject', 'Publisher'), summarize, Stock = sum(Stock), Sold = sum(Quantity), Ratio = round((100 * sum(Quantity)/ sum(Stock)), digits = 1)) #melt it m.book.sales = melt(data = book.sales1, id.vars = c('Region', 'Representative', 'SubjCat', 'Subject', 'Publisher'), measured.vars = c('Stock', 'Sold', 'Ratio')) # cast it Tab1 <- dcast(data = m.book.sales, formula = Region + Representative ~ Publisher + variable, fun.aggregate = sum, margins = c('Region', 'Representative')) Now my questions : I have been able to add the subtotals in rows. But is it possible also to add margins in the columns. Say for example, Totals of Stock for one Publisher ? Sorry I meant to say example total sold for all publishers There is a problem with the columns with “ratio”. How can I get “mean” instead of “sum” for this variable ? P.S: I have seen some examples using reshape. Will you recommend to use it instead of reshape2 (which seems not to include the functionalities of two functions).

Read the article

Trouble with applying a nested loop on a list

- by user1665355

I have a list consisting of 3 elements: datalist=list(a=datanew1,b=datanew2,c=datanew3) datalist$a : Inv_ret Firm size leverage Risk Liquidity Equity 17 0.04555968 17.34834 0.1323199 0.011292273 0.02471489 0 48 0.01405835 15.86315 0.6931730 0.002491093 0.12054914 0 109 0.04556252 16.91602 0.1714068 0.006235836 0.01194579 0 159 0.04753472 14.77039 0.3885720 0.007126830 0.06373028 0 301 0.03941040 16.94377 0.1805346 0.005450653 0.01723319 0 datalist$b : Inv_ret Firm size leverage Risk Liquidity Equity 31 0.04020832 18.13300 0.09326265 0.015235240 0.01579559 0.005025379 62 0.04439078 17.84086 0.11016402 0.005486982 0.01266566 0.006559096 123 0.04543250 18.00517 0.12215307 0.011154742 0.01531451 0.002282790 173 0.03960613 16.45457 0.10828643 0.011506857 0.02385191 0.009003780 180 0.03139643 17.57671 0.40063094 0.003447233 0.04530395 0.000000000 datalist$c : Inv_ret Firm size leverage Risk Liquidity Equity 92 0.03081029 19.25359 0.10513159 0.01635201 0.025760806 0.000119744 153 0.03280746 19.90229 0.11731517 0.01443786 0.006769735 0.011999005 210 0.04655847 20.12543 0.11622403 0.01418010 0.003125632 0.003802365 250 0.03301018 20.67197 0.13208234 0.01262499 0.009418828 0.021400052 282 0.04355975 20.03012 0.08588316 0.01918129 0.004213846 0.023657440 I am trying to create a cor.test on the datalist above : Cor.tests=sapply(datalist,function(x){ for(h in 1:length(names(x))){ for(i in 1:length(names(x$h[i]))){ for(j in 1:length(names(x$h[j]))){ cor.test(x$h[,i],x$h[,j])$p.value }}}}) But I get an error : Error in cor.test.default(x$h[, i], x$h[, j]) : 'x' must be a numeric vector Any suggestions about what I am doing wrong? P.S. If I simply have one dataframe, datanew1 : Inv_ret Firm size leverage Risk Liquidity Equity 17 0.04555968 17.34834 0.1323199 0.011292273 0.02471489 0 48 0.01405835 15.86315 0.6931730 0.002491093 0.12054914 0 109 0.04556252 16.91602 0.1714068 0.006235836 0.01194579 0 159 0.04753472 14.77039 0.3885720 0.007126830 0.06373028 0 301 0.03941040 16.94377 0.1805346 0.005450653 0.01723319 0 I use this loop : results=matrix(NA,nrow=6,ncol=6) for(i in 1:length(names(datanew1))){ for(j in 1:length(names(datanew1))){ results[i,j]<-cor.test(datanew1[,i],datanew1[,j])$p.value }} And the output is: results : [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.000000e+00 7.085663e-09 3.128975e-10 3.018239e-02 4.806400e-10 0.475139526 [2,] 7.085663e-09 0.000000e+00 2.141581e-21 0.000000e+00 2.247825e-20 0.454032499 [3,] 3.128975e-10 2.141581e-21 0.000000e+00 2.485924e-25 2.220446e-16 0.108643838 [4,] 3.018239e-02 0.000000e+00 2.485924e-25 0.000000e+00 5.870007e-15 0.006783324 [5,] 4.806400e-10 2.247825e-20 2.220446e-16 5.870007e-15 0.000000e+00 0.558827862 [6,] 4.751395e-01 4.540325e-01 1.086438e-01 6.783324e-03 5.588279e-01 0.000000000 Which is exactly what I want. But I want to get 3 matrices, one for each element of the datalist above.

Search Results

Search found 81 results on 4 pages for 'dataframe'.

Page 3/4 | < Previous Page | 1 2 3 4 | Next Page >

- by user2808117

- by EricCoding

- by pssguy

- by wuha

- by Moe

- by bluetongue

- by Andreas

- by PaulHurleyuk

- by Misha

- by JayCrossler

- by user655832

- by Tahnoon Pasha

- by Chris

- by Null-Hypothesis

- by Stedy

- by dmvianna

- by Stedy

- by wahalulu

- by D W

- by schultem

- by romunov

- by PaulHurleyuk

- by PaulHurleyuk

- by user1043144

- by user1665355

< Previous Page | 1 2 3 4 | Next Page >