Search Results

Search found 2515 results on 101 pages for 'distributed filesystems'.

Page 1/101 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Recommendations for distributed processing/distributed storage systems

- by Eddie

At my organization we have a processing and storage system spread across two dozen linux machines that handles over a petabyte of data. The system right now is very ad-hoc; processing automation and data management is handled by a collection of large perl programs on independent machines. I am looking at distributed processing and storage systems to make it easier to maintain, evenly distribute load and data with replication, and grow in disk space and compute power. The system needs to be able to handle millions of files, varying in size between 50 megabytes to 50 gigabytes. Once created, the files will not be appended to, only replaced completely if need be. The files need to be accessible via HTTP for customer download. Right now, processing is automated by perl scripts (that I have complete control over) which call a series of other programs (that I don't have control over because they are closed source) that essentially transforms one data set into another. No data mining happening here. Here is a quick list of things I am looking for: Reliability: These data must be accessible over HTTP about 99% of the time so I need something that does data replication across the cluster. Scalability: I want to be able to add more processing power and storage easily and rebalance the data on across the cluster. Distributed processing: Easy and automatic job scheduling and load balancing that fits with processing workflow I briefly described above. Data location awareness: Not strictly required but desirable. Since data and processing will be on the same set of nodes I would like the job scheduler to schedule jobs on or close to the node that the data is actually on to cut down on network traffic. Here is what I've looked at so far: Storage Management: GlusterFS: Looks really nice and easy to use but doesn't seem to have a way to figure out what node(s) a file actually resides on to supply as a hint to the job scheduler. GPFS: Seems like the gold standard of clustered filesystems. Meets most of my requirements except, like glusterfs, data location awareness. Ceph: Seems way to immature right now. Distributed processing: Sun Grid Engine: I have a lot of experience with this and it's relatively easy to use (once it is configured properly that is). But Oracle got its icy grip around it and it no longer seems very desirable. Both: Hadoop/HDFS: At first glance it looked like hadoop was perfect for my situation. Distributed storage and job scheduling and it was the only thing I found that would give me the data location awareness that I wanted. But I don't like the namename being a single point of failure. Also, I'm not really sure if the MapReduce paradigm fits the type of processing workflow that I have. It seems like you need to write all your software specifically for MapReduce instead of just using Hadoop as a generic job scheduler. OpenStack: I've done some reading on this but I'm having trouble deciding if it fits well with my problem or not. Does anyone have opinions or recommendations for technologies that would fit my problem well? Any suggestions or advise would be greatly appreciated. Thanks!

Read the article
Building a Redundant / Distributed Application

- by MattW

This is more of a "point me in the right direction" question. My team of three and I have built a hosted web app that queues and routes customer chat requests to available customer service agents (It does other things as well, but this is enough background to illustrate the issue). The basic dev architecture today is: a single page ajax web UI (ASP.NET MVC) with floating chat windows (think Gmail) a backend Windows service to queue and route the chat requests this service also logs the chats, calculates service levels, etc a Comet server product that routes data between the web frontend and the backend Windows service this also helps us detect which Agents are still connected (online) And our hardware architecture today is: 2 servers to host the web UI portion of the application a load balancer to route requests to the 2 different web app servers a third server to host the SQL Server DB and the backend Windows service responsible for queuing / delivering chats So as it stands today, one of the web app servers could go down and we would be ok. However, if something would happen to the SQL Server / Windows Service server we would be boned. My question - how can I make this backend Windows service logic be able to be spread across multiple machines (distributed)? The Windows service is written to accept requests from the Comet server, check for available Agents, and route the chat to those agents. How can I make this more distributed? How can I make it so that I can distribute the work of the backend Windows service can be spread across multiple machines for redundancy and uptime purposes? Will I need to re-write it with distributed computing in mind? I should also note that I am hosting all of this on Rackspace Cloud instances - so maybe it is something I should be less concerned about? Thanks in advance for any help!

Read the article
Fast distributed filesystem for a large amounts of data with metadata in database

- by undefined hero

My project uses several processing machines and one storage machine. Currently storage organized with a MSSQL filetable shared folder. Every file in storage have some metadata in database. Processing machines executes tasks for which they needed files from storage and their metadata. After completing task, processing machine puts resulting data back in storage. From there its taken by another processing machine, which also generates some file and put it back in storage. And etc. Everything was fine, but as number of processing machines increases, I found myself bottlenecked myself with storage machines hard drive performance. So I want processing machines to put files in distributed FS. to lift load from storage machines, from which they can take data from each other, not only storage machine. Can You suggest a particular distributed FS which meets my needs? Or there is another way to solve this problem, without it? Amounts of data in FS in one time are like several terabytes. (storage can handle this, but processors cannot). Data consistence is critical. Read write policy is: once file is written - its constant and may be only removed, but not modified. My current platform is Windows, but I'm ready to switch it, if there is a substantially more convenient solution on another one.

Read the article
Java - System design with distributed Queues and Locks

- by sunny

Looking for inputs to evaluate a design for a system (java) which would have a distributed queue serving several (but not too many) nodes. These nodes would process objects present in the distributed queue and on occasion require a distributed lock across the cluster on an arbitrary (distributed) data structures. These (distributed) data structures could potentially lie in a distributed cache. Eliminating Terracotta (DSO),Hazelcast and Akka what could be alternative choices. Currently considering zookeeper as a distributed locking mechanism. Since the recommendation of a znode is not to exceed the 1M size , the understanding is that zookeeper should not be used a distributed queue. And also from Netflix curator tech note 4. So should a distributed cache, say like memcached, or redis be used to emulate a distributed queue ? i.e. The distributed queue will be stored in the caches and will be locked cluster-wide via zookeeper. Are there potential pitfalls with this high-level approach. The objects don't need to be taken off the queue. The object will pass through a lifecycle which will determine its removal from the queue. There would be about 10k+ objects in a queue at a given time changing states and any node could service one stage of the object's lifecycle. (Although not strictly necessary .. i.e. one node could serve the entire lifecycle if that is more efficient.) Any suggestions/alternatives ? sidenote: new to zookeeper ; redis etc.

Read the article
Distributed storage and computing

- by Tim van Elteren

Dear Serverfault community, After researching a number of distributed file systems for deployment in a production environment with the main purpose of performing both batch and real-time distributed computing I've identified the following list as potential candidates, mainly on maturity, license and support: Ceph Lustre GlusterFS HDFS FhGFS MooseFS XtreemFS The key properties that our system should exhibit: an open source, liberally licensed, yet production ready, e.g. a mature, reliable, community and commercially supported solution; ability to run on commodity hardware, preferably be designed for it; provide high availability of the data with the most focus on reads; high scalability, so operation over multiple data centres, possibly on a global scale; removal of single points of failure with the use of replication and distribution of (meta-)data, e.g. provide fault-tolerance. The sensitivity points that were identified, and resulted in the following questions, are: transparency to the processing layer / application with respect to data locality, e.g. know where data is physically located on a server level, mainly for resource allocation and fast processing, high performance, how can this be accomplished? Do you from experience know what solutions provide this transparency and to what extent? posix compliance, or conformance, is mentioned on the wiki pages of most of the above listed solutions. The question here mainly is, how relevant is support for the posix standard? Hadoop for example isn't posix compliant by design, what are the pro's and con's? what about the difference between synchronous and asynchronous opeartion of a distributed file system. Though a synchronous distributed file system has the preference because of reliability it also imposes certain limitations with respect to scalability. What would be, from your expertise, the way to go on this? I'm looking forward to your replies. Thanks in advance! :) With kind regards, Tim van Elteren

Read the article
Question regarding filesystems true or false?

- by Avon

Hello all, though I'm familiar with stackoverflow , and loving it , i've actually got a couple of questions myself about something other then programming. Here are my question Is it true that in FAT filesystems the maximum number of files per filesystem equals the number of entries in the FAT table. And is it also true that in indexed filesystems the maximum number of files per filesystem equals the number of indexblocks – 1. I'm reading some stuff and am trying to get a good understanding of it.

Read the article
Journaled filesystems and power failure

- by Yoga

I heard that even a journaled filesystems such as EXT3/EXT4 might corrupted during power failure, e.g. from wikipedia [1]: In the event of a system crash or power failure, such file systems are quicker to bring back online and less likely to become corrupted. Can anyone provide more detail by giving examples such that when corruption can occur corruption is avoided by journaled filesystems [1] http://en.wikipedia.org/wiki/Journaling_file_system

Read the article
Elastic versus Distributed in caching.

- by Mike Reys

Until now, I hadn't heard about Elastic Caching yet. Today I read Mike Gualtieri's Blog entry. I immediately thought about Oracle Coherence and got a little scare throughout the reading. Elastic Caching is the next step after Distributed Caching. As we've always positioned Coherence as a Distributed Cache, I thought for a brief instance that Oracle had missed a new trend/technology. But then I started reading the characteristics of an Elastic Cache. Forrester definition: Software infrastructure that provides application developers with data caching services that are distributed across two or more server nodes that consistently perform as volumes grow can be scaled without downtime provide a range of fault-tolerance levels Hey wait a minute, doesn't Coherence fullfill all these requirements? Oh yes, I think it does! The next defintion in the article is about Elastic Application Platforms. This is mainly more of the same with the addition of code execution. Now there is analytics functionality in Oracle Coherence. The analytics capability provides data-centric functions like distributed aggregation, searching and sorting. Coherence also provides continuous querying and event-handling. I think that when it comes to providing an Elastic Application Platform (as in the Forrester definition), Oracle is close, nearly there. And what's more, as Elastic Platform is the next big thing towards the big C word, Oracle Coherence makes you cloud-ready ;-) There you go! Find more info on Oracle Coherence here.

Read the article
To see virtual filesystems in Ubuntu

- by Masi

How can you see the virtual filesystems at root? They should be /proc and /dev.

Read the article
Building a distributed system on Amazon Web Services

- by Songo

Would simply using AWS to build an application make this application a distributed system? For example if someone uses RDS for the database server, EC2 for the application itself and S3 for hosting user uploaded media, does that make it a distributed system? If not, then what should it be called and what is this application lacking for it to be distributed? Update Here is my take on the application to clarify my approach to building the system: The application I'm building is a social game for Facebook. I developed the application locally on a LAMP stack using Symfony2. For production I used an a single EC2 Micro instance for hosting the app itself, RDS for hosting my database, S3 for the user uploaded files and CloudFront for hosting static content. I know this may sound like a naive approach, so don't be shy to express your ideas.

Read the article
Design a Distributed System

- by Bonton255

I am preparing for an interview on Distributed Systems. I have gone through a lot of text and understand the basics of the area. However, I need some examples of discussions on designing a distributed system given a scenario. For example, if I were to design a distributed system to calculate if a number N is primary or not, what will the be design of the system, what will be the impact of network latency, CPU performance, node failure, addition of nodes, time synchronization etc. If you guys could present your in-depth thoughts on this example, or point me to some similar discussion, that would be really helpful.

Read the article
Filesystems for webserver with SATA and Solid State disk,

- by Jorisslob

We have just ordered a new webserver with 120 Gb solid state disk and a SATA disk. I am trying to plan ahead what sort of filesystem to use. This system will be running Linux, Apache/Tomcat to host java services. The main service is a system where people can upload reasonably large files (in the order of 100 Mb, images, image stacks and video), which people will be able to annotate and which will be sent to a database server when annotation is complete. Thus far, I plan to put most of the utility programs of the operating system om the SSD and put the large media files there. The SATA disks will hold the less volitile data like apache, tomcat and the servlets. For filesystems I have considered going for the stable EXT3 because I hear that it is best supported. The downside seems to be that it not the ideal choice for large files. That is why I am leaning towards using XFS for the SSD and EXT3 for the SATA. My questions are: 1) Does this sound like a reasonable setup? 2) What filesystems would you recommend for the SSD and for the SATA? Thanks

Read the article
Can an issue tracking system be distributed?

- by Klaim

I was thinking about issue tracking software like Redmine, Trac or even the one that is in Fossil and something hit me: Is there a reason why Redmine and Trac are not possible to be distributed? Or maybe it's possible and I just don't know how it's possible? If it's not possible, why? By distributed I mean like Facebook or Google or other applications that effectively runs on multiple hardware a the same time but share data.

Read the article
tag structured Filesystems

- by A.Rashad

I hope this is the correct site, I lose my way between the 4 sister sites :) Let me ask the question this way. all file systems I have seen before are hierarchical, that means a root directory, with some branched directories, and so on until we have files residing in these directories. except for AS/400 file structure, where it has a concept of a Library that serve somehow as a directory but one level only. Why not have directory-less filesystems where files are placed in a single location, but the file identifiers would be referenced by a database of tag/ file relation ships. This way there will be no need for symbolic links, one file may have multiple relations to multiple subjects, not only a single parent directory to contain. I hope the idea is clear.

Read the article
Does exportfs disrupt users already utilizing those filesystems?

- by CptSupermrkt

I need to modify a servers /etc/exports file to export to an additional host. After modifying this file, for it to take effect (i.e. for the additional host to have access to the designated filesystem), I believe I have to run "exportfs" on the server exporting the filesystem. Does this disrupt users who are currently using filesystems that are exported from that serving host? I'm hoping to add this new host "silently", without disruption. Any additional advice related to this, common traps, things to be careful of, etc. would be appreciated if you have any. Edit: just in case...uname -a returns 2.6.32-358.18.1.el6.x86_64 #1 SMP Fri Aug 2 17:04:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Read the article
Distributed Computing Framework (.NET) - Specifically for CPU Instensive operations

- by StevenH

I am currently researching the options that are available (both Open Source and Commercial) for developing a distributed application. "A distributed system consists of multiple autonomous computers that communicate through a computer network." Wikipedia The application is focused on distributing highly cpu intensive operations (as opposed to data intensive) so I'm sure MapReduce solutions don't fit the bill. Any framework that you can recommend ( + give a brief summary of any experience or comparison to other frameworks ) would be greatly appreciated. Thanks.

Read the article
Distributed Development Tools -- (Version control and Project Management)

- by Macy Abbey

Hello, I've recently become responsible for choosing which source control and project management software to use for a company that employs me. Currently it uses Jira (project management) and Subversion (version control). I know there are many other options out there -- the ones I know about are all in this article http://mashable.com/2010/07/14/distributed-developer-teams/ . I'm leaning towards recommending they just stay with what they have as it seems workable and any change would have to be worth the cost of switching to say github/basecamp or some other solution. Some details on the team: It's a distributed development shop. Meetings of the whole team in one room are rare. It's currently a very small development team (three developers). The project management software is used by developers and a product manager or two. What are you experiences with version control and project management web applications? Are there any you would recommend and you think are worth the switching cost of time to learn new services / implementing the change? Edit: After educating myself further on the options it appears DVCS offer powerful benefits that may be worth investing in now as opposed to later in the company's lifetime when the switching cost is higher: I'm a Subversion geek, why I should consider or not consider Mercurial or Git or any other DVCS?

Read the article
Distributed Development Tools -- (Version control and Project Management)

- by Macy Abbey

I've recently become responsible for choosing which source control and project management software to use for a company that employs me. Currently it uses Jira (project management) and Subversion (version control). I know there are many other options out there -- the ones I know about are all in this article http://mashable.com/2010/07/14/distributed-developer-teams/ . I'm leaning towards recommending they just stay with what they have as it seems workable and any change would have to be worth the cost of switching to say github/basecamp or some other solution. Some details on the team: It's a distributed development shop. Meetings of the whole team in one room are rare. It's currently a very small development team (three developers). The project management software is used by developers and a product manager or two. What are you experiences with version control and project management web applications? Are there any you would recommend and you think are worth the switching cost of time to learn new services / implementing the change? Edit: After educating myself further on the options it appears DVCS offer powerful benefits that may be worth investing in now as opposed to later in the company's lifetime when the switching cost is higher: I'm a Subversion geek, why I should consider or not consider Mercurial or Git or any other DVCS?

Read the article
JavaScript distributed computing project

- by Ben L.

I made a website that does absolutely nothing, and I've proven to myself that people like to stay there - I've already logged 11+ hours worth of cumulative time on the page. My question is whether it would be possible (or practical) to use the website as a distributed computing site. My first impulse was to find out if there were any JavaScript distributed computing projects already active, so that I could put a piece of code on the page and be done. Unfortunately, all I could find was a big list of websites that thought it might be a cool idea. I'm thinking that I might want to start with something like integer factorization - in this case, RSA numbers. It would be easy for the server to check if an answer was correct (simply test for modulus equals zero), and also easy to implement. Is my idea feasible? Is there already a project out there that I can use?

Read the article
Ceph: open-source petabyte scale distributed storage

- by Gregory Burd

You might want to read this interesting research paper titled "Object Storage on Berkeley DB". The Ceph project is a distributed network file system designed to provide excellent performance, reliability, and scalability... and it uses Oracle Berkeley DB for storage.

Read the article
Synchronizing local and remote cache in distributed caching

- by ltfishie

With a distributed cache, a subset of the cache is kept locally while the rest is held remotely. In a get operation, if the entry is not available locally, the remote cache will be used and and the entry is added to local cache. In a put operation, both the local cache and remote cache are updated. Other nodes in the cluster also need to be notified to invalidate their local cache as well. What's a simplest way to achieve this if you implemented it yourself, assuming that nodes are not aware of each other. Edit My current implementation goes like this: Each cache entry contains a time stamp. Put operation will update local cache and remote cache Get operation will try local cache then remote cache A background thread on each node will check remote cache periodically for each entry in local cache. If the timestamp on remote is newer overwrite the local. If entry is not found in remote, delete it from local.

Read the article
Scalable distributed file system for blobs like images and other documents

- by Pinnacle

Cassandra & HBase both do not efficiently support storage of blobs like images. Storing directly on HDFS stresses the Namenode because of huge number of files. Facebook uses Haystack for images and attachments storage, but this is not open source. So is Lustre a good choice for distributed blob storage? I have read that Amazon S3 is used by many, but this would cost money and personally, I would not like to rely on third party system. What are other suggestions?

Read the article
alternative filesystems for SSD

- by freedrull

I am tired of watching fsck check my filesystem when my eeepc 901 shuts down abruptly due to a crash. I know that with a journaling filesystem, I won't have to wait for a check. However, I am well aware of the poor I/O performance of the SSD, so I can imagine using a journaling filesystem being even more frustrating, since there will be constant writes to the journal? I will buy a new laptop without such a crummy ssd someday but, is there anything I can do now, on the software side of things?

Read the article
the effect of large number of files on disk space in unix filesystems

- by user46976

If I have a text file in Unix that contains N-many independent entries (e.g. records about employees, where each employee has a separate record), is it expected that this file will take up less space than if I split the file into N files, each containing the entry for one employee? in other words, can one save significant space on unix file systems by concatenating many files together, or is the difference negligible? thanks.

Read the article
Join multiple filesystems (on multiple computers) into one big volume

- by jm666

Scenario: Have 10 computers, each have 12x2TB HDDs (currently) in raidZ2 (10+2) configuration, so, in the each computer i have one approx. 20TB volume. Now, need those 10 separate computers (separate raid groups) join into one big volume. What is the recommended solution? I'm thinking about the FCoE (10GB ethernet). So, buying into each computer FCoE (10GB ethernet card) and - what need more on the hardware side? (probably another computer, FCoE switch? like Cisco Nexus?) The main question is: what need to install and configure on each computer? Currently they have freebsd/raidz2, but it is possible change it into Linux/Solaris if needed. Any helpful resource what talking about how to build a big volumes from smaller raid-groups (on the software side) is very welcomed. So, what OS, what filesystem, what software - etc. In short: want get one approx. 200TB storage (in one filesystem) from already existing computers/storage. Don't need fast writes, but need good performance on reading data. (as a big fileserver), what will works transparently, so when storing data don't want care about onto what computer the data goes. (e.g. not 10 mountpoints - but one big logical filesystem). Thanks.

Read the article

Search Results

Search found 2515 results on 101 pages for 'distributed filesystems'.

Page 1/101 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Eddie

- by MattW

- by undefined hero

- by sunny

- by Tim van Elteren

- by Avon

- by Yoga

- by Mike Reys

- by Masi

- by Songo

- by Bonton255

- by Jorisslob

- by Klaim

- by A.Rashad

- by CptSupermrkt

- by StevenH

- by Macy Abbey

- by Macy Abbey

- by Ben L.

- by Gregory Burd

- by ltfishie

- by Pinnacle

- by freedrull

- by user46976

- by jm666

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >