Search Results

Search found 81 results on 4 pages for 'dataframe'.

Page 2/4 | < Previous Page | 1 2 3 4 | Next Page >

R counting the occurrences of similar rows of data frame

- by Matt

I have data in the following format called DF (this is just a made up simplified sample): eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random 1 1 1500 1500 100 120 40 232342 2 2 1000 1250 100 120 40 11843 3 3 1250 1250 100 120 40 981340234 4 4 1000 1187.5 100 120 40 4363453 5 1 2000 2000 200 100 40 345902 6 1 3000 3000 150 90 10 943 7 1 2000 2000 90 90 100 9304358 8 2 1800 1900 90 90 100 284333 However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows. The example above uses the expected values, but assume they are incorrect. How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables? I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

Read the article
R selecting duplicate rows

- by Matt

Okay, I'm fairly new to R and I've tried to search the documentation for what I need to do but here is the problem. I have a data.frame called heeds.data in the following form (some columns omitted for simplicity) eval.num, eval.count, ... fitness, fitness.mean, green.h.0, green.v.0, offset.0, green.h.1, green.v.1,...green.h.7, green.v.7, offset.7... And I have selected a row meeting the following criteria: best.fitness <- min(heeds.data$fitness.mean[heeds.data$eval.count = 10]) best.row <- heeds.data[heeds.data$fitness.mean == best.fitness] Now, what I want are all of the other rows with that have columns green.h.0 to offset.7 (a contiguous section of columns) equal to the best.row Basically I'm looking for rows that have some of the conditions the same as the "best" row. I thought I could just do this, heeds.best <- heeds.data$fitness[ heeds.data$green.h.0 == best.row$green.h.0 & ... ] But with 24 columns it seems like a stupid method. Looking for something a bit simpler with less manual typing. Thanks!

Read the article
stored values within a custom function

- by romunov

My program takes a data.frame and crunches the numbers. At one point, values from j-th column are multiplied by a predefined values that depends on the column name (species name, actually - it's en ecological index). So far, I've been providing these values via a second data.frame by matching column names. What would be an efficient way of integrating fixed variable values within a function? I would like my program to be as portable as possible, without the need for a second data.frame file.

Read the article
R data frame select by global variable

- by Matt

I'm not sure how to do this without getting an error. Here is a simplified example of my problem. Say I have this data frame DF a b c d 1 2 3 4 2 3 4 5 3 4 5 6 Then I have a variable x <- min(c(1,2,3)) Now I want do do the following y <- DF[a == x] But when I try to refer to some variable like "x" I get an error because R is looking for a column "x" in my data frame. I get the "undefined columns selected" error How can I do what I am trying to do in R?

Read the article
R: Converting a list of data frames into one data frame

- by JD Long

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame. I got some pointers from an earlier question which was trying to do something similar but more complex. Here's an example of what I am starting with (this is grossly simplified for illustration): listOfDataFrames <- NULL for (i in 1:100) { listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T), b=rnorm(500), c=rnorm(500)) } I am currently using this: df <- do.call("rbind", listOfDataFrames) *EDIT* whoops. In my haste to implement what I had "learned" in a previous question I totally screwed up. Yes, the unlist() is just plain wrong. I'm editing that out of the question above.

Read the article
R: How to replace elements of a data.frame?

- by John

I'm trying to replace elements of a data.frame containing "#N/A" with "NULL", and I'm running into problems: foo <- data.frame("day"= c(1, 3, 5, 7), "od" = c(0.1, "#N/A", 0.4, 0.8)) indices_of_NAs <- which(foo == "#N/A") replace(foo, indices_of_NAs, "NULL") Error in [<-.data.frame(*tmp*, list, value = "NULL") : new columns would leave holes after existing columns I think that the problem is that my index is treating the data.frame as a vector, but that the replace function is treating it differently somehow, but I'm not sure what the issue is?

Read the article
How to sort a Pandas DataFrame according to multiple criteria?

- by user1715271

I have the following DataFrame containing song names, their peak chart positions and the number of weeks they spent at position no 1: Song Peak Weeks 76 Paperback Writer 1 16 117 Lady Madonna 1 9 118 Hey Jude 1 27 22 Can't Buy Me Love 1 17 29 A Hard Day's Night 1 14 48 Ticket To Ride 1 14 56 Help! 1 17 109 All You Need Is Love 1 16 173 The Ballad Of John And Yoko 1 13 85 Eleanor Rigby 1 14 87 Yellow Submarine 1 14 20 I Want To Hold Your Hand 1 24 45 I Feel Fine 1 15 60 Day Tripper 1 12 61 We Can Work It Out 1 12 10 She Loves You 1 36 155 Get Back 1 6 8 From Me To You 1 7 115 Hello Goodbye 1 7 2 Please Please Me 2 20 92 Strawberry Fields Forever 2 12 93 Penny Lane 2 13 107 Magical Mystery Tour 2 16 176 Let It Be 2 14 0 Love Me Do 4 26 157 Something 4 9 166 Come Together 4 10 58 Yesterday 8 21 135 Back In The U.S.S.R. 19 3 164 Here Comes The Sun 58 19 96 Sgt. Pepper's Lonely Hearts Club Band 63 12 105 With A Little Help From My Friends 63 7 I'd like to rank these songs in order of popularity, so I'd like to sort them according to the following criteria: songs that reached the highest position come first, but if there is a tie, the songs that remained in the charts for the longest come first. I can't seem to figure out how to do this in Pandas.

Read the article
How to paste text and variables into a logical expression in R?

- by Jasper

I want to paste variables in the logical expression that I am using to subset data, but the subset function does not see them as column names when pasted (either with ot without quotes). I have a dataframe with columns named col1, col2 etc. I want to subset for the rows in which colx < 0.05 This DOES work: subsetdata<-subset(dataframe, col1<0.05) subsetdata<-subset(dataframe, col2<0.05) This does NOT work: for (k in 1:2){ subsetdata<-subset(dataframe, paste("col",k,sep="")<0.05) } for (k in 1:2){ subsetdata<-subset(dataframe, noquote(paste("col",k,sep=""))<0.05) } I can't find the answer; any suggestions?

Read the article
Efficient alternative to merge() when building dataframe from json files with R?

- by Bryan

I have written the following code which works, but is painfully slow once I start executing it over thousands of records: require("RJSONIO") people_data <- data.frame(person_id=numeric(0)) json_data <- fromJSON(json_file) n_people <- length(json_data) for(lender in 1:n_people) { person_dataframe <- as.data.frame(t(unlist(json_data[[person]]))) people_data <- merge(people_data, person_dataframe, all=TRUE) } output_file <- paste("people_data",".csv") write.csv(people_data, file=output_file) I am attempting to build a unified data table from a series of json-formated files. The fromJSON() function reads in the data as lists of lists. Each element of the list is a person, which then contains a list of the attributes for that person. For example: [[1]] person_id name gender hair_color [[2]] person_id name location gender height [[...]] structure(list(person_id = "Amy123", name = "Amy", gender = "F", hair_color = "brown"), .Names = c("person_id", "name", "gender", "hair_color")) structure(list(person_id = "matt53", name = "Matt", location = structure(c(47231, "IN"), .Names = c("zip_code", "state")), gender = "M", height = 172), .Names = c("person_id", "name", "location", "gender", "height")) The end result of the code above is matrix where the columns are every person-attribute that appears in the structure above, and the rows are the relevant values for each person. As you can see though, some data is missing for some of the people, so I need to ensure those show up as NA and make sure things end up in the right columns. Further, location itself is a vector with two components: state and zip_code, meaning it needs to be flattened to location.state and location.zip_code before it can be merged with another person record; this is what I use unlist() for. I then keep the running master table in people_data. The above code works, but do you know of a more efficient way to accomplish what I'm trying to do? It appears the merge() is slowing this to a crawl... I have hundreds of files with hundreds of people in each file. Thanks! Bryan

Read the article
[R] how do I quickly group the time column in a dataframe into intervals?

- by Derek

Hi, Assuming in R, I have a data.frame with the first column representing the time (as POSIXct). The rest of the columns (e.g., columns 2) are numeric data. I would like to group time into 3-minute intervals. Each interval will the the average of values that falls into that particular interval. Right now, I have a for-loop that iterates through the time column and generate the interval on the fly. I am wondering if there's a more elegant way to accomplish the same thing? Thanks in advance. Derek

Read the article
how to pass arguments into function within a function in r

- by jon

I am writing function that involve other function from base R with alot of arguments. For example (real function is much longer): myfunction <- function (dataframe, Colv = NA) { matrix <- as.matrix (dataframe) out <- heatmap(matrix, Colv = Colv) return(out) } data(mtcars) myfunction (mtcars, Colv = NA) The heatmap has many arguments that can be passed to: heatmap(x, Rowv=NULL, Colv=if(symm)"Rowv" else NULL, distfun = dist, hclustfun = hclust, reorderfun = function(d,w) reorder(d,w), add.expr, symm = FALSE, revC = identical(Colv, "Rowv"), scale=c("row", "column", "none"), na.rm = TRUE, margins = c(5, 5), ColSideColors, RowSideColors, cexRow = 0.2 + 1/log10(nr), cexCol = 0.2 + 1/log10(nc), labRow = NULL, labCol = NULL, main = NULL, xlab = NULL, ylab = NULL, keep.dendro = FALSE, verbose = getOption("verbose"), ...) I want to use these arguments without listing them inside myfun. myfunction (mtcars, Colv = NA, col = topo.colors(16)) Error in myfunction(mtcars, Colv = NA, col = topo.colors(16)) : unused argument(s) (col = topo.colors(16)) I tried the following but do not work: myfunction <- function (dataframe, Colv = NA) { matrix <- as.matrix (dataframe) out <- heatmap(matrix, Colv = Colv, ....) return(out) } data(mtcars) myfunction (mtcars, Colv = NA, col = topo.colors(16))

Read the article
How split a column in two colunms in pandas

- by user1345283

I have el next dataframe data=read_csv('enero.csv') data Fecha DirViento MagViento 0 2011/07/01 00:00 318 6.6 1 2011/07/01 00:15 342 5.5 2 2011/07/01 00:30 329 6.6 3 2011/07/01 00:45 279 7.5 4 2011/07/01 01:00 318 6.0 5 2011/07/01 01:15 329 7.1 6 2011/07/01 01:30 300 4.7 7 2011/07/01 01:45 291 3.1 How to split the column Fecha in two columns,for example, get a dataframe as follows: Fecha Hora DirViento MagViento 0 2011/07/01 00:00 318 6.6 1 2011/07/01 00:15 342 5.5 2 2011/07/01 00:30 329 6.6 3 2011/07/01 00:45 279 7.5 4 2011/07/01 01:00 318 6.0 5 2011/07/01 01:15 329 7.1 6 2011/07/01 01:30 300 4.7 7 2011/07/01 01:45 291 3.1 I am using pandas for to read data I try to calculate daily averages from a monthly database has daily data recorded every 15 minutes. To do this, use pandas and grouped the columns: Date and Time for get a dataframe as follow: Fecha Hora 2011/07/01 00:00 -4.4 00:15 -1.7 00:30 -3.4 2011/07/02 00:00 -4.5 00:15 -4.2 00:30 -7.6 2011/07/03 00:00 -6.3 00:15 -13.7 00:30 -0.3 with this look, I get the following grouped.mean() Fecha DirRes 2011/07/01 -3 2011/07/02 -5 2011/07/03 -6

Read the article
How do you calculate expanding mean on time series using pandas?

- by mlo

How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2. I have tried every way I could think of but just can't seem to get it right. left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],'Mod_ID': [15, 35, 15, 42], 'car': ['ford','honda', 'ford', 'lexus']}) right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'], 'Mod_ID': [15, 15, 35, 42]}) df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)

Read the article
How to keep columns labels when numeric convert to character

- by stata

a<- data.frame(sex=c(1,1,2,2,1,1),bq=factor(c(1,2,1,2,2,2))) library(Hmisc) label(a$sex)<-"gender" label(a$bq)<-"xxx" str(a) b<-data.frame(lapply(a, as.character), stringsAsFactors=FALSE) str(b) When I covert dataframe a columns to character,the columns labels disappeared.My dataframe have many columns.Here as an example only two columns. How to keep columns labels when numeric convert to character? Thank you!

Read the article
Existing function to slice pandas object by axis number

- by Zero

Pandas has the following indexers: Object Type Indexers Series s.loc[indexer] DataFrame df.loc[row_indexer,column_indexer] Panel p.loc[item_indexer,major_indexer,minor_indexer] I would like to be able to index dynamically by axis, for example: df = pd.DataFrame(data=0, index=['row1', 'row2', 'row3'], columns=['col1', 'col2', col3']) df.index(['row1', 'row3'], axis=0) # index by rows df.index(['col1', 'col2'], axis=1) # index by columns Is there a built-in function that does this?

Read the article
R equivalent of SELECT DISTINCT on two or more fields/variables

- by wahalulu

Say I have a dataframe df with two or more columns, is there an easy way to use unique() or other R function to create a subset of unique combinations of two or more columns? I know I can use sqldf() and write an easy "SELECT DISTINCT var1, var2, ... varN" query, but I am looking for an R way of doing this. It occurred to me to try ftable coerced to a dataframe and use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset: uniques <- as.data.frame(ftable(df$var1, df$var2))

Read the article
how to determine if a character vector is a valid numeric or integer vector

- by Andrew Barr

I am trying to turn a nested list structure into a dataframe. The list looks similar to the following (it is serialized data from parsed JSON read in using the httr package). myList <- list(object1 = list(w=1, x=list(y=0.1, z="cat")), object2 = list(w=2, x=list(y=0.2, z="dog"))) unlist(myList) does a great job of recursively flattening the list, and I can then use lapply to flatten all the objects nicely. flatList <- lapply(myList, FUN= function(object) {return(as.data.frame(rbind(unlist(object))))}) And finally, I can button it up using plyr::rbind.fill myDF <- do.call(plyr::rbind.fill, flatList) str(myDF) #'data.frame': 2 obs. of 3 variables: #$ w : Factor w/ 2 levels "1","2": 1 2 #$ x.y: Factor w/ 2 levels "0.1","0.2": 1 2 #$ x.z: Factor w/ 2 levels "cat","dog": 1 2 The problem is that w and x.y are now being interpreted as character vectors, which by default get parsed as factors in the dataframe. I believe that unlist() is the culprit, but I can't figure out another way to recursively flatten the list structure. A workaround would be to post-process the dataframe, and assign data types then. What is the best way to determine if a vector is a valid numeric or integer vector?

Read the article
How to mutate rows of data frame - replacing one value with another

- by rhh

I'm having trouble with what I think is a basic R task. Here's my sample dataframe named 'b' Winner Color Size Tom Yellow Med Jerry Yellow Lar Jane Blue Med where items in the Winner column are factors. I'm trying to change "Tom" in the dataframe to "Tom LLC" and I can't get it done. Here's what I tried: Simple way: b$winner[b$winner=='Tom'] = as.factor('Tom LLC') but that failed with "invalid factor level, NAs generated" Next I tried a more advanced route: name_reset = function (x, y, z) { if (x$winner == y) {x$winner = z} } b = adply(b,1,name_reset,'Tom','Tom LLC') but that failed with "Error in list_to_dataframe(res, attr(.data, "split_labels")) : Results are not equal lengths" I feel I'm missing something basic. Can someone redirect me or offer suggestions on the code I wrote above? Thank you very much

Read the article
Replace multible data codes in a datafram with names

- by Shabana

I have a problem in replacing codes in a dataframe of 3890 observations. My dataframe has a character variable df$IJN which contains values from 1 to 27. I would like to replace these with meaningful data as follow If(1 OR 6 OR 10 OR 14 OR 18 OR 22 OR 26) should be replaced with UL. If(3 OR 7 OR 11 OR 15 OR 19 OR 23 OR 27) should be replaced with LL. If(4 OR 8 OR 12 OR 16 OR 20 OR 24) should be replaced with UR. If(5 OR 9 OR 13 OR 17 OR 21 OR 25) should be replaced with LR. (U,L,R,and L Refer to Upper, Lower, Right, and Left sites in the order) I thought of a for() with if() could not manage with it Also thought of df[which(df=="27")] ="LL" may work one by one not sure! Any help please. R v3.1 - Windows 7 E-H Shabana, Paris.

Read the article
How can I neatly clean my R workspace while preserving certain objects?

- by briandk

Suppose I'm messing about with some data by binding vectors together, as I'm wont to do on a lazy sunday afternoon. x <- rnorm(25, mean = 65, sd = 10) y <- rnorm(25, mean = 75, sd = 7) z <- 1:25 dd <- data.frame(mscore = x, vscore = y, caseid = z) I've now got my new dataframe dd, which is wonderful. But there's also still the detritus from my prior slicings and dicings: > ls() [1] "dd" "x" "y" "z" What's a simple way to clean up my workspace if I no longer need my "source" columns, but I want to keep the dataframe? That is, now that I'm done manipulating data I'd like to just have dd and none of the smaller variables that might inadvertently mask further analysis: > ls() [1] "dd" I feel like the solution must be of the form rm(ls[ -(dd) ]) or something, but I can't quite figure out how to say "please clean up everything BUT the following objects."

Read the article
Non standard interaction among two tables to avoid very large merge

- by riko

Suppose I have two tables A and B. Table A has a multi-level index (a, b) and one column (ts). b determines univocally ts. A = pd.DataFrame( [('a', 'x', 4), ('a', 'y', 6), ('a', 'z', 5), ('b', 'x', 4), ('b', 'z', 5), ('c', 'y', 6)], columns=['a', 'b', 'ts']).set_index(['a', 'b']) AA = A.reset_index() Table B is another one-column (ts) table with non-unique index (a). The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x. Moreover, there is always a value in B.ix[x] that is greater than or equal to the values in A. B = pd.DataFrame( dict(a=list('aaaaabbcccccc'), ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a') The semantics in this is that B contains observations of occurrences of an event of type indicated by the index. I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B. So, my goal would be: C: ('a', 'x') 4 ('a', 'y') 7 ('a', 'z') 5 ('b', 'x') 7 ('b', 'z') 7 ('c', 'y') 8 I have some working code, but is terribly slow. C = AA.apply(lambda row: ( row[0], row[1], B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b']) Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run. Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's. 1,000,000 x 200 x 300 = 60,000,000,000 rows may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above. How would I improve the performance?

Read the article
How to get the second sub element of every element in a list in R

- by PaulHurleyuk

I know I've come across this problem before, but I'm having a bit of a mental block at the moment. and as I can't find it on SO, I'll post it here so I can find it next time. I have a dataframe that contains a field representing an ID label. This label has two parts, an alpha prefix and a numeric suffix. I want to split it apart and create two new fields with these values in. structure(list(lab = c("N00", "N01", "N02", "B00", "B01", "B02", "Z21", "BA01", "NA03")), .Names = "lab", row.names = c(NA, -9L ), class = "data.frame") df$pre<-strsplit(df$lab, "[0-9]+") df$suf<-strsplit(df$lab, "[A-Z]+") Which gives lab pre suf 1 N00 N , 00 2 N01 N , 01 3 N02 N , 02 4 B00 B , 00 5 B01 B , 01 6 B02 B , 02 7 Z21 Z , 21 8 BA01 BA , 01 9 NA03 NA , 03 So, the first strsplit works fine, but the second gives a list, each having two elements, an empty string and the result I want, and stuffs them both into the dataframe column. How can I select the second sub-element from each element of the list ? (or, is there a better way to do this) Thanks Paul.

Read the article
convert variable with mixed date formats to one format in r

- by jalapic

A sample of my dataframe: date 1 25 February 1987 2 20 August 1974 3 9 October 1984 4 18 August 1992 5 19 September 1995 6 16-Oct-63 7 30-Sep-65 8 22 Jan 2008 9 13-11-1961 10 18 August 1987 11 15-Sep-70 12 5 October 1994 13 5 December 1984 14 03/23/87 15 30 August 1988 16 26-10-1993 17 22 August 1989 18 13-Sep-97 I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats. I have tried using straightforward conversions e.g. strftime(mydf$date,"%d/%m/%Y") but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution? Code for example: structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L, 10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961", "13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992", "19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008", "03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988", "5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA, -18L), class = "data.frame")

Read the article
Crosstab with multiple items

- by Michael Wexler

In SPSS, it is (relatively) easy to create a cross tab with multiple variables using the factors as the table heading. So, something like the following (made up data, etc.). Q1, Q2, and Q3 each have either a 1, a 2, or a 3 for each person. 1 (very Often) 2 (Rarely) 3 (Never) Q1. Likes it 12 15 13 Q2. Recommends it 22 11 10 Q3. Used it 22 12 9 In SPSS, one can even request row, column, or total percentages. I've tried table(), ftable(), xtab(), CrossTable() from gmodels, and CrossTable() from descr, and none of these can handle (afaik) multiple variables; they mostly seem to handle 1 variable crossed with another variable, and the 3rd creates layers. Is there a package with some good cross tabbing/table examples that I could use to figure this out? I'm sure I'm missing something simple, so I appreciate you pointing out what I missed. Perhaps I have to generate each row as a separate list and then make a dataframe and print the dataframe?

Read the article
How to sort a boxplot by the median values in pandas

- by Chris

I've got a dataframe outcome2 that I generate a grouped boxplot with in the following manner: In [11]: outcome2.boxplot(column='Hospital 30-Day Death (Mortality) Rates from Heart Attack',by='State') plt.ylabel('30 Day Death Rate') plt.title('30 Day Death Rate by State') Out [11]: What I'd like to do is sort the plot by the median for each state, instead of alphabetically. Not sure how to go about doing so.

Read the article

< Previous Page | 1 2 3 4 | Next Page >