Is there a way to efficiently yield every file in a directory containing millions of files?
Posted
by
Josh Smeaton
on Stack Overflow
See other posts from Stack Overflow
or by Josh Smeaton
Published on 2011-02-23T11:44:30Z
Indexed on
2011/02/23
23:25 UTC
Read the original article
Hit count: 189
I'm aware of os.listdir
, but as far as I can gather, that gets all the filenames in a directory into memory, and then returns the list. What I want, is a way to yield a filename, work on it, and then yield the next one, without reading them all into memory.
Is there any way to do this? I worry about the case where filenames change, new files are added, and files are deleted using such a method. Some iterators prevent you from modifying the collection during iteration, essentially by taking a snapshot of the state of the collection at the beginning, and comparing that state on each move
operation. If there is an iterator capable of yielding filenames from a path, does it raise an error if there are filesystem changes (add, remove, rename files within the iterated directory) which modify the collection?
There could potentially be a few cases that could cause the iterator to fail, and it all depends on how the iterator maintains state. Using S.Lotts example:
filea.txt
fileb.txt
filec.txt
Iterator yields filea.txt
. During processing
, filea.txt
is renamed to filey.txt
and fileb.txt
is renamed to filez.txt
. When the iterator attempts to get the next file, if it were to use the filename filea.txt
to find it's current position in order to find the next file and filea.txt
is not there, what would happen? It may not be able to recover it's position in the collection. Similarly, if the iterator were to fetch fileb.txt
when yielding filea.txt
, it could look up the position of fileb.txt
, fail, and produce an error.
If the iterator instead was able to somehow maintain an index dir.get_file(0)
, then maintaining positional state would not be affected, but some files could be missed, as their indexes could be moved to an index 'behind' the iterator.
This is all theoretical of course, since there appears to be no built-in (python) way of iterating over the files in a directory. There are some great answers below, however, that solve the problem by using queues and notifications.
Edit:
The OS of concern is Redhat. My use case is this:
Process A is continuously writing files to a storage location. Process B (the one I'm writing), will be iterating over these files, doing some processing based on the filename, and moving the files to another location.
Edit:
Definition of valid:
Adjective 1. Well grounded or justifiable, pertinent.
(Sorry S.Lott, I couldn't resist).
I've edited the paragraph in question above.
© Stack Overflow or respective owner