Necessary Infrastructure for large project with many components communicating through IPCs
- by jluzwick
I have a fairly in depth question which probably doesn't have an exact answer.
As a software engineer, I am usually tasked with working on a program or project with minimal understanding of how other components or programs in the project interact with each other. When one program fails in a sea of multiple components and processes, what infrastructure elements are necessary to ensure that the problem can be accurately tracked to the violating application?
More specifically, what infrastructure elements should be necessary for this large project and which are optional but very helpful. One such example I can think of is some form of a common logging infrastructure that allows for a developer or tester to easily browse through a log that contains numerous components for messages that might allude to the culprit program along with a "trail" of what happened before the issue occurred. I'm thinking of something similar to Androids alogcat tool.
These necessary infrastructure elements should be language-agnostic.
While these elements should be understood by all engineers on the team in question, which elements should be understood at great detail by the technical system engineers and what should the individual software engineers be responsible for adding to their tools to allow for such infrastructures to take hold?
Please feel free to ask for clarification if something does not make sense as I understand this question is very broad and needs some refinement. I will refine as necessary from the answers and comments I receive.
Thanks for any help!