Search Results

Search found 4 results on 1 pages for 'cuckoo'.

Page 1/1 | 1

PyPy -- How can it possible beat CPython?

- by Vulcan Eager

From the Google Open Source Blog: PyPy is a reimplementation of Python in Python, using advanced techniques to try to attain better performance than CPython. Many years of hard work have finally paid off. Our speed results often beat CPython, ranging from being slightly slower, to speedups of up to 2x on real application code, to speedups of up to 10x on small benchmarks. How is this possible? Which Python implementation was used to implement PyPy? CPython? And what are the chances of a PyPyPy or PyPyPyPy beating their score? (On a related note... why would anyone try something like this?)

Read the article
PyPy -- How can it possibly beat CPython?

- by Vulcan Eager

From the Google Open Source Blog: PyPy is a reimplementation of Python in Python, using advanced techniques to try to attain better performance than CPython. Many years of hard work have finally paid off. Our speed results often beat CPython, ranging from being slightly slower, to speedups of up to 2x on real application code, to speedups of up to 10x on small benchmarks. How is this possible? Which Python implementation was used to implement PyPy? CPython? And what are the chances of a PyPyPy or PyPyPyPy beating their score? (On a related note... why would anyone try something like this?)

Read the article
What Counts For a DBA: Simplicity

- by Louis Davidson

Too many computer processes do an apparently simple task in a bizarrely complex way. They remind me of this strip by one of my favorite artists: Rube Goldberg. In order to keep the boss from knowing one was late, a process is devised whereby the cuckoo clock kisses a live cuckoo bird, who then pulls a string, which triggers a hat flinging, which in turn lands on a rod that removes a typewriter cover…and so on. We rely on creating automated processes to keep on top of tasks. DBAs have a lot of tasks to perform: backups, performance tuning, data movement, system monitoring, and of course, avoiding being noticed. Every day, there are many steps to perform to maintain the database infrastructure, including: checking physical structures, re-indexing tables where needed, backing up the databases, checking those backups, running the ETL, and preparing the daily reports and yes, all of these processes have to complete before you can call it a day, and probably before many others have started that same day. Some of these tasks are just naturally complicated on their own. Other tasks become complicated because the database architecture is excessively rigid, and we often discover during “production testing” that certain processes need to be changed because the written requirements barely resembled the actual customer requirements. Then, with no time to change that rigid structure, we are forced to heap layer upon layer of code onto the problematic processes. Instead of a slight table change and a new index, we end up with 4 new ETL processes, 20 temp tables, 30 extra queries, and 1000 lines of SQL code. Report writers then need to build reports and make magical numbers appear from those toxic data structures that are overly complex and probably filled with inconsistent data. What starts out as a collection of fairly simple tasks turns into a Goldbergian nightmare of daily processes that are likely to cause your dinner to be interrupted by the smartphone doing the vibration dance that signifies trouble at the mill. So what to do? Well, if it is at all possible, simplify the problem by either going into the code and refactoring the complex code to simple, or taking all of the processes and simplifying them into small, independent, easily-tested steps. The former approach usually requires an agreement on changing underlying structures that requires countless mind-numbing meetings; while the latter can generally be done to any complex process without the same frustration or anger, though it will still leave you with lots of steps to complete, the ability to test each step independently will definitely increase the quality of the overall process (and with each step reporting status back, finding an actual problem within the process will be definitely less unpleasant.) We all know the principle behind simplifying a sequence of processes because we learned it in math classes in our early years of attending school, starting with elementary school. In my 4 years (ok, 9 years) of undergraduate work, I remember pretty much one thing from my many math classes that I apply daily to my career as a data architect, data programmer, and as an occasional indentured DBA: “show your work”. This process of showing your work was my first lesson in simplification. Each step in the process was in fact, far simpler than the entire process. When you were working an equation that took both sides of 4 sheets of paper, showing your work was important because the teacher could see every step, judge it, and mark it accordingly. So often I would make an error in the first few lines of a problem which meant that the rest of the work was actually moving me closer to a very wrong answer, no matter how correct the math was in the subsequent steps. Yet, when I got my grade back, I would sometimes be pleasantly surprised. I passed, yet missed every problem on the test. But why? While I got the fact that 1+1=2 wrong in every problem, the teacher could see that I was using the right process. In a computer process, the process is very similar. We take complex processes, show our work by storing intermediate values, and test each step independently. When a process has 100 steps, each step becomes a simple step that is tested and verified, such that there will be 100 places where data is stored, validated, and can be checked off as complete. If you get step 1 of 100 wrong, you can fix it and be confident (that if you did your job of testing the other steps better than the one you had to repair,) that the rest of the process works. If you have 100 steps, and store the state of the process exactly once, the resulting testable chunk of code will be far more complex and finding the error will require checking all 100 steps as one, and usually it would be easier to find a specific needle in a stack of similarly shaped needles. The goal is to strive for simplicity either in the solution, or at least by simplifying every process down to as many, independent, testable, simple tasks as possible. For the tasks that really can’t be done completely independently, minimally take those tasks and break them down into simpler steps that can be tested independently. Like working out division problems longhand, have each step of the larger problem verified and tested.

Read the article
rm on a directory with millions of files

- by BMDan

Background: physical server, about two years old, 7200-RPM SATA drives connected to a 3Ware RAID card, ext3 FS mounted noatime and data=ordered, not under crazy load, kernel 2.6.18-92.1.22.el5, uptime 545 days. Directory doesn't contain any subdirectories, just millions of small (~100 byte) files, with some larger (a few KB) ones. We have a server that has gone a bit cuckoo over the course of the last few months, but we only noticed it the other day when it started being unable to write to a directory due to it containing too many files. Specifically, it started throwing this error in /var/log/messages: ext3_dx_add_entry: Directory index full! The disk in question has plenty of inodes remaining: Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda3 60719104 3465660 57253444 6% / So I'm guessing that means we hit the limit of how many entries can be in the directory file itself. No idea how many files that would be, but it can't be more, as you can see, than three million or so. Not that that's good, mind you! But that's part one of my question: exactly what is that upper limit? Is it tunable? Before I get yelled at--I want to tune it down; this enormous directory caused all sorts of issues. Anyway, we tracked down the issue in the code that was generating all of those files, and we've corrected it. Now I'm stuck with deleting the directory. A few options here: rm -rf (dir)I tried this first. I gave up and killed it after it had run for a day and a half without any discernible impact. unlink(2) on the directory: Definitely worth consideration, but the question is whether it'd be faster to delete the files inside the directory via fsck than to delete via unlink(2). That is, one way or another, I've got to mark those inodes as unused. This assumes, of course, that I can tell fsck not to drop entries to the files in /lost+found; otherwise, I've just moved my problem. In addition to all the other concerns, after reading about this a bit more, it turns out I'd probably have to call some internal FS functions, as none of the unlink(2) variants I can find would allow me to just blithely delete a directory with entries in it. Pooh. while [ true ]; do ls -Uf | head -n 10000 | xargs rm -f 2/dev/null; done ) This is actually the shortened version; the real one I'm running, which just adds some progress-reporting and a clean stop when we run out of files to delete, is: export i=0; time ( while [ true ]; do ls -Uf | head -n 3 | grep -qF '.png' || break; ls -Uf | head -n 10000 | xargs rm -f 2/dev/null; export i=$(($i+10000)); echo "$i..."; done ) This seems to be working rather well. As I write this, it's deleted 260,000 files in the past thirty minutes or so. Now, for the questions: As mentioned above, is the per-directory entry limit tunable? Why did it take "real 7m9.561s / user 0m0.001s / sys 0m0.001s" to delete a single file which was the first one in the list returned by "ls -U", and it took perhaps ten minutes to delete the first 10,000 entries with the command in #3, but now it's hauling along quite happily? For that matter, it deleted 260,000 in about thirty minutes, but it's now taken another fifteen minutes to delete 60,000 more. Why the huge swings in speed? Is there a better way to do this sort of thing? Not store millions of files in a directory; I know that's silly, and it wouldn't have happened on my watch. Googling the problem and looking through SF and SO offers a lot of variations on "find" that obviously have the wrong idea; it's not going to be faster than my approach for several self-evident reasons. But does the delete-via-fsck idea have any legs? Or something else entirely? I'm eager to hear out-of-the-box (or inside-the-not-well-known-box) thinking. Thanks for reading the small novel; feel free to ask questions and I'll be sure to respond. I'll also update the question with the final number of files and how long the delete script ran once I have that. Final script output!: 2970000... 2980000... 2990000... 3000000... 3010000... real 253m59.331s user 0m6.061s sys 5m4.019s So, three million files deleted in a bit over four hours.

Read the article

1