Search Results

Search found 81 results on 4 pages for 'dataframe'.

Page 1/4 | 1 2 3 4 | Next Page >

Pandas Dataframe add rows on top of dataframe

- by yash.trojan.25

I am trying to add blank rows on top of the pandas Dataframe data. Basically, some blank rows and some calculation for each row which contains calculations for Average etc. for that column. Can someone please help me how I can do this? From: A B D E F G H I J 0 -8 10 532 533 533 532 534 532 532 1 -8 12 520 521 523 523 521 521 521 2 -8 14 520 523 522 523 522 521 522 3 -4 2 526 527 527 528 528 527 529 4 -4 4 516 518 517 519 518 516 518 5 -4 6 528 529 530 531 530 528 530 6 -4 8 518 521 521 521 522 519 521 7 -4 10 524 525 525 525 525 524 524 8 -4 12 522 523 524 525 525 522 523 9 -2 2 525 526 527 527 527 525 527 10 -2 4 518 519 519 521 520 519 520 11 -2 6 520 522 522 522 522 520 523 12 -2 8 551 551 552 552 552 550 552 13 -2 10 533 534 535 536 535 534 535 14 -2 12 537 539 539 539 538 537 539 15 -2 14 528 530 530 531 530 529 530 16 -1 2 518 519 519 521 520 518 520 To: A B D E F G H I J Average 525.6 527.1 527.4 528.0 527.6 526.0 527.4 Sigma 8.6 8.3 8.5 8.1 8.3 8.3 8.4 Minimum 516 518 517 519 518 516 518 Maximum 551 551 552 552 552 550 552 0 -8 10 532 533 533 532 534 532 532 1 -8 12 520 521 523 523 521 521 521 2 -8 14 520 523 522 523 522 521 522 3 -4 2 526 527 527 528 528 527 529 4 -4 4 516 518 517 519 518 516 518 5 -4 6 528 529 530 531 530 528 530 6 -4 8 518 521 521 521 522 519 521 7 -4 10 524 525 525 525 525 524 524 8 -4 12 522 523 524 525 525 522 523 9 -2 2 525 526 527 527 527 525 527 10 -2 4 518 519 519 521 520 519 520 11 -2 6 520 522 522 522 522 520 523 12 -2 8 551 551 552 552 552 550 552 13 -2 10 533 534 535 536 535 534 535 14 -2 12 537 539 539 539 538 537 539 15 -2 14 528 530 530 531 530 529 530 16 -1 2 518 519 519 521 520 518 520

Read the article
R : remove columns from dataframe where ALL values are NA

- by Sophomore

hello everybody! I'm having some trouble with my huge data frame and couldn't really resolve that question myself: The dataframe has some properties as columns and each row represents one data set. I've done some sanatizing to this dataframe (e.g. get rid of datasets which are not to be included in evaluation). (Whoever might be interested: Beforehand I aggregate around 5000 single text files and put them in a tsv, some of the proerties have a sequence number like "button.pressed.1" ... ""button.pressed.n". Some of the sets excluded had really high numbers for n but got excluded, all sets left have much smaller numbers for n but the property "button.presed.50" is still there and all remaining sets have an NA in that column. Actually its a different property but the example should clarify my intention...) So the question is quite simple (for some sophisticated R pro): I need to get rid of columns where for ALL rows the value is NA. Could someone please help me out? (All I have managed to get rid of columns where at least one NA exists which dropped about half my columns)...

Read the article
Converting a dataframe to a vector (by rows)

- by mortalitysequence

I have a dataframe with numeric entries like this one test <- data.frame(x=c(26,21,20),y=c(34,29,28)) How can I get the following vector? > 26,34,21,29,20,28

Read the article
writing to a dataframe from a for-loop in R

- by CCID

I'm trying to write from a loop to a data frame in R, for example a loop like this for (i in 1:20) { print(c(i+i,i*i,i/1))} and to write each line of 3 values to a data frame with three columns, so that each iteration takes on a new row. I've tried using matrix, with ncol=3 and filled by rows, but only get the last item from the loop. Thanks.

Read the article
Summarising grouped records in a dataframe in R (...again)

- by monch1962

Hello all, (I tried to ask this question earlier today, but later realised I over-simplified the question; the answers I received were correct, but I couldn't use them because of my over-simplification of the problem in the original question. Here's my 2nd attempt...) I have a data frame in R that looks like: "Timestamp", "Source", "Target", "Length", "Content" 0.1 , P1 , P2 , 5 , "ABCDE" 0.2 , P1 , P2 , 3 , "HIJ" 0.4 , P1 , P2 , 4 , "PQRS" 0.5 , P2 , P1 , 2 , "ZY" 0.9 , P2 , P1 , 4 , "SRQP" 1.1 , P1 , P2 , 1 , "B" 1.6 , P1 , P2 , 3 , "DEF" 2.0 , P2 , P1 , 3 , "IJK" ... and I want to convert this to: "StartTime", "EndTime", "Duration", "Source", "Target", "Length", "Content" 0.1 , 0.4 , 0.3 , P1 , P2 , 12 , "ABCDEHIJPQRS" 0.5 , 0.9 , 0.4 , P2 , P1 , 6 , "ZYSRQP" 1.1 , 1.6 , 0.5 , P1 , P2 , 4 , "BDEF" ... Trying to put this into English, I want to group consecutive records with the same 'Source' and 'Target' together, then print out a single record per group showing the StartTime, EndTime & Duration (=EndTime-StartTime) for that group, along with the sum of the Lengths for that group, and a concatenation of the Content (which will all be strings) in that group. The TimeOffset values will always increase throughout the data frame. I had a look at melt/recast and have a feeling that it could be used to solve the problem, but couldn't get my head around the documentation. I suspect it's possible to do this within R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

Read the article
Summarising grouped records in a dataframe in R

- by monch1962

Hello all, I have a data frame in R that looks like this: > TimeOffset, Source, Length > 0 1 1500 > 0.1 1 1000 > 0.2 1 50 > 0.4 2 25 > 0.6 2 3 > 1.1 1 1500 > 1.4 1 18 > 1.6 2 2500 > 1.9 2 18 > 2.1 1 37 > ... and I want to convert it to > TimeOffset, Source, Length > 0.2 1 2550 > 0.6 2 28 > 1.4 1 1518 > 1.9 2 2518 > ... Trying to put this into English, I want to group consecutive records with the same 'Source' together, then printing out a single record per group showing the highest time offset in that group, the source, and the sum of the lengths in that group. The TimeOffset values will always increase. I suspect this is possible in R, but I really don't know where to start. In a pinch I could export the data frame out and do it in e.g. Python, but I'd prefer to stay within R if possible. Thanks in advance for any assistance you can provide

Read the article
Get particular row as series from pandas dataframe

- by Pratyush

How do we get a particular filtered row as series? Example dataframe: >>> df = pd.DataFrame({'date': [20130101, 20130101, 20130102], 'location': ['a', 'a', 'c']}) >>> df date location 0 20130101 a 1 20130101 a 2 20130102 c I need to select the row where location is c as a series. I tried: row = df[df["location"] == "c"].head(1) # gives a dataframe row = df.ix[df["location"] == "c"] # also gives a dataframe with single row In either cases I can't the row as series.

Read the article
dataframe of averagetemperatures with Years in rows and months in columns

- by Maxwell Mkondiwa

I have data with variables minimum and maximum temperatures, month and years(1951-2001). I want to get a dataframe of average temperatures for each month in each year. I want the dataframe to look like this: Year jan feb mar apr may june..... 1951 xx xx 1952 xx 1953 . .

Read the article
pandas: complex filter on rows of DataFrame

- by duckworthd

I would like to filter rows by a function of each row, e.g. def f(row): return sin(row['velocity'])/np.prod(['masses']) > 5 df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, f)] Or for another more complex, contrived example, def g(row): if row['col1'].method1() == 1: val = row['col1'].method2() / row['col1'].method3(row['col3'], row['col4']) else: val = row['col2'].method5(row['col6']) return np.sin(val) df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, g)] How can I do so?

Read the article
R: Select subset of dataframe by non-unique ids

- by amarillion

Suppose I have a dataframe like this one: df <- data.frame (id = c("a", "b", "a", "c", "e", "d", "e"), n=1:7) and a vector with ids like this one: v <- c("a", "b") How can I select the rows of the dataframe that match the ids in v? I can't use the id column for rownames because they are not unique.

Read the article
Filtering a dataframe in row names from a column value in R

- by user1755155

Basically I have dataframe with two columns (target_id and fpkm). I want to keep only those row names in first column that are not duplicated. For example in the below dataframe you can see there are two row names with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and from both and I want to keep only one comp267138_c0_seq2 based of high value in column 2. target_id fpkm comp247393_c0_seq1 3.197885 comp257058_c0_seq4 1.624577 comp242590_c0_seq1 1.750319 comp77911_c0_seq1 1.293059 comp241426_c0_seq1 1.626589 comp288413_c0_seq1 14.828853 comp294436_c0_seq1 11.555596 comp63603_c0_seq1 1.982386 comp267138_c0_seq1 8.594494 comp267138_c0_seq2 11.134958 comp321623_c0_seq1 6.934149

Read the article
Pandas Dataframe to JSON File with Separate Records

- by Chris

I'm attempting to dump data from a Pandas Dataframe into a JSON file to import into MongoDB. The format I require in a file has JSON records on each line of the form: {<column 1>:<value>,<column 2>:<value>,...,<column N>:<value>} df.to_json(,orient='records') gets close to the result but all the records are dumped within a single JSON array. Any thoughts on an efficient way to get this result from a dataframe? UPDATE: The best solution I've come up with is the following: dlist = df.to_dict('records') dlist = [json.dumps(record)+"\n" for record in dlist] open('data.json','w').writelines(dlist)

Read the article
How to retrieve view of MultiIndex DataFrame

- by Henry S. Harrison

This question was inspired by this question. I had the same problem, updating a MultiIndex DataFrame by selection. The drop_level=False solution in Pandas 0.13 will allow me to achieve the same result, but I am still wondering why I cannot get a view from the MultiIndex DataFrame. In other words, why does this not work?: >>> sat = d.xs('sat', level='day', copy=False) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs raise ValueError('Cannot retrieve view (copy=False)') ValueError: Cannot retrieve view (copy=False) Of course it could be only because it is not implemented, but is there a reason? Is it somehow ambiguous or impossible to implement? Returning a view is more intuitive to me than returning a copy then later updating the original. I looked through the source and it seems this situation is checked explicitly to raise an error. Alternatively, is it possible to get the same sort of view from any of the other indexing methods? I've experimented but have not been successful. [edit] Some potential implementations are discussed here. I guess with the last question above I'm wondering what the current best solution is to index into arbitrary multiindex slices and cross-sections.

Read the article
R get rid of rows with duplicate attribute

- by CatholicEvangelist

hi there I have a big dataframe with columns such as: ID, time, OS, IP Each row of that dataframe corresponds to one entry. Within that dataframe for some IDs serveral entries (rows) exist. I would like to get rid of those multiple rows (obviously the other attributes will differ for the same ID). Or put different: I only want one single entry (row) for each ID. Could anyone be so kind and help me out on this one? When I use unique, on the ID column, I only recieve the levels (or each unique ID), but I want to keep the other attributes as well... I have tried to use apply(x,2,unique(data$ID)), but this does not work either... Any help is highly appreciated!

Read the article
Memory efficient import many data files into panda DataFrame in Python

- by richardh

I import into a panda DataFrame a directory of |-delimited.dat files. The following code works, but I eventually run out of RAM with a MemoryError:. import pandas as pd import glob temp = [] dataDir = 'C:/users/richard/research/data/edgar/masterfiles' for dataFile in glob.glob(dataDir + '/master_*.dat'): print dataFile temp.append(pd.read_table(dataFile, delimiter='|', header=0)) masterAll = pd.concat(temp) Is there a more memory efficient approach? Or should I go whole hog to a database? (I will move to a database eventually, but I am baby stepping my move to pandas.) Thanks! FWIW, here is the head of an example .dat file: cik|cname|ftype|date|fileloc 1000032|BINCH JAMES G|4|2011-03-08|edgar/data/1000032/0001181431-11-016512.txt 1000045|NICHOLAS FINANCIAL INC|10-Q|2011-02-11|edgar/data/1000045/0001193125-11-031933.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-11|edgar/data/1000045/0001193125-11-005531.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-27|edgar/data/1000045/0001193125-11-015631.txt 1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2011-02-14|edgar/data/1000045/0000929638-11-00151.txt

Read the article
I am getting an error when trying to use melt() on a dataframe containing Dates

- by Dan

I'd like to melt the dataframe so that in one column i have dates in a second i have username as the variable and finally the value. I'm getting this error: Error in as.Date.numeric(value) : 'origin' must be supplied and while I understand the error I'm not exactly sure how to get around it. A small sample of the data is: structure(list(created_at = structure(c(14007, 14008, 14009, 14010, 14011, 14012), class = "Date"), benjamin = c(16, 0, 0, 0, 0, 0), byron = c(0, 0, 0, 0, 0, 0), cameronc = c(0, 0, 0, 0, 0, 0), daniel = c(0, 0, 0, 0, 0, 0), djdiaz = c(0, 0, 0, 0, 0, 0), gene = c(16, 77, 64, 38, 72, 36), joel = c(0, 0, 0, 0, 0, 2), kerem = c(0, 0, 0, 0, 0, 0), sophia = c(0, 0, 0, 0, 0, 0), SuperMoonMan = c(0, 0, 0, 0, 0, 0)), .Names = c("created_at", "benjamin", "byron", "cameronc", "daniel", "djdiaz", "gene", "joel", "kerem", "sophia", "SuperMoonMan"), row.names = c(NA, 6L), class = c("cast_df", "data.frame")) Thanks for your help.

Read the article
Replace values in a dataframe based on another factor which contains NA's in R

- by PaulHurleyuk

I have a dataframe which contains (among other things) a numeric column with a concentration, and a factor column with a status flag. This status flag contains NA's. Here's an example df<-structure(list(conc = c(101.769, 1.734, 62.944, 92.697, 25.091, 27.377, 24.343, 55.084, 0.335, 23.280), status = structure(c(NA, NA, NA, NA, NA, NA, 2L, NA, 1L, NA), .Label = c("<LLOQ", "NR"), class = "factor")), .Names = c("conc", "status"), row.names = c(NA, -10L), class = "data.frame") I want to replace the concentration column with a string for some values of the flag column, or with the concentration value formatted to a certain number of significant digits. When I try this ifelse(df$status=="NR","NR",df$conc) The NA's in the status flag don't trigger either the true or false condition (and return NA) - as the documentation suggests it will. I could loop over the rows and use IF then else on each one but this seems inefficient. Am I missing something ? I've tried as.character(df$status) as well which doesn't work. My mojo must be getting low....

Read the article
Creating a dataframe in pandas by multiplying two series together

- by Aoife

Say I have two series in pandas, series A and series B. How do I create a dataframe in which all of those values are multiplied together, i.e. with series A down the left hand side and series B along the top. Basically the same concept as this, where series A would be the yellow on the left and series B the yellow along the top, and all the values in between would be filled in by multiplication: http://www.google.co.uk/imgres?imgurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables/times-table-12x12.gif&imgrefurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables.htm&h=533&w=720&sz=58&tbnid=9B8R_kpUloA4NM:&tbnh=90&tbnw=122&zoom=1&usg=__meqZT9kIAMJ5b8BenRzF0l-CUqY=&docid=j9BT8tUCNtg--M&sa=X&ei=bkBpUpOWOI2p0AWYnIHwBQ&ved=0CE0Q9QEwBg Thanks!

Read the article
How can I collapse a dataframe by some variables, taking mean across others

- by Alex Holcombe

I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this: Widget Type Energy egg 1 20 egg 2 30 jap 3 50 jap 1 60 Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield Widget Energy egg 25 jap 55 In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python (http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related (http://www.mail-archive.com/[email protected]/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?

Read the article
R: outlier cleaning for each column in a dataframe by using quantiles 0.05 and 0.95

- by Rainer

hi, I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest. g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10) If i do a simple scaling from 0 - 1 the result would be: > round((g - min(g))/abs(max(g) - min(g)),1) [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile. So the pre-scaled result would be: g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**) and scaled: > round((g - min(g))/abs(max(g) - min(g)),1) [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0 I need this formula for a whole dataframe, so the functional implementation within R should be something like: > apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95)) Can anyone help? Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut and cut2. cut fails because of not-unique breaks; cut2 would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1. for trial: a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1) b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10) c<-cbind(a,b) c<-as.data.frame(c) Regards and thanks for help, Rainer

Read the article
Load XML to DataFrame in R

- by Rohit Kandhal

I am new to R programming and trying to load a simple XML in RStudio. I tried using XMLToDataFrame but got this error XML content does not seem to be XML: 'temp.xml' XML Schema <root> <row Id="1" UserId="1" Name="Rohit" Date="2009-06-29T10:28:58.013" /> <row Id="2" UserId="3" Name="Rohit" Date="2009-06-29T10:28:58.030" /> </root> Please provide me some direction on which function I should use here.

Read the article
add several variables to dataframe, based on vector

- by Andreas

I am sure this is easy - but I can't figure it out right now. Basically: I have a long vector of variables: names <- c("first","second", "third") I have some data, and I now need to add the variables. I could do: data$first <- NA But since I have a long list, and I would like an automated solution. This doesn't work. for (i in 1:length(names)) (paste("data$", names[i],sep="") <- NA) The reason I want this, is that I need to vertically merge to dataframes, where one doesn't have all the variables it should have. Thanks in advance

Read the article
How to replace&add the dataframe element by another dataframe in Python Pandas?

- by bigbug

Suppose I have two data frame 'df_a' & 'df_b' , both have the same index structure and columns, but some of the inside data elements are different: >>> df_a sales cogs STK_ID QT 000876 1 100 100 2 100 100 3 100 100 4 100 100 5 100 100 6 100 100 7 100 100 >>> df_b sales cogs STK_ID QT 000876 5 50 50 6 50 50 7 50 50 8 50 50 9 50 50 10 50 50 And now I want to replace the element of df_a by element of df_b which have the same (index, column) coordinate, and attach df_b's elements whose (index, column) coordinate beyond the scope of df_a . Just like add a patch 'df_b' to 'df_a' : >>> df_c = patch(df_a,df_b) sales cogs STK_ID QT 000876 1 100 100 2 100 100 3 100 100 4 100 100 5 50 50 6 50 50 7 50 50 8 50 50 9 50 50 10 50 50 How to write the 'patch(df_a,df_b)' function ?

Read the article
break dataframe into subsets by factor values, send to function that returns glm class, how to recom

- by Alex Holcombe

Thanks to Hadley's plyr package ddply function we can take a dataframe, break it down into subdataframes by factors, send each to a function, and then combine the function results for each subdataframe into a new dataframe. But what if the function returns an object of a class like glm or in my case, a c("glm", "lm"). Then, these can't be combined into a dataframe can they? I get this error instead Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class 'c("glm", "lm")' into a data.frame Is there some more flexible data structure that will accommodate all the complex glm class results of my function calls, preserving the information regarding the dataframe subsets? Or should this be done in an entirely different way?

Read the article
R counting the occurance of similar rows of data frame

- by Matt

I have data in the following format called DF (this is just a made up simplified sample): eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random 1 1 1500 1500 100 120 40 232342 2 2 1000 1250 100 120 40 11843 3 3 1250 1250 100 120 40 981340234 4 4 1000 1187.5 100 120 40 4363453 5 1 2000 2000 200 100 40 345902 6 1 3000 3000 150 90 10 943 7 1 2000 2000 90 90 100 9304358 8 2 1800 1900 90 90 100 284333 However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows. The example above uses the expected values, but assume they are incorrect. How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables? I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

Read the article

1 2 3 4 | Next Page >