This is a small one, but better to get this right from the beginning - we didn't, but are almost done fixing it. And this isn't really EDA specific - its applicable to any large scale distributed programming effort. Our problem was that various developers logged different things. Most used log4j, some didn't. Some configured it one way, others another. Some logged everything to INFO - some used DEBUG properly ... What you want is uniform logging.
We have approximately 50 different services. Each service is typically an event source and sink. Some are just event sources, others just sinks.
Our system has an admin GUI that lets you see the last 10 events (configurable). The runtime just keeps various buffers for incoming and outgoing events. When a user clicks on a particular service, he sees the incoming/outgoing events.
But logging to log files is a different issue. Obviously, this is where you go looking when there is a problem in the system.
Due to privacy / regulatory requirements, we have to be careful what we log. When there is an error, however, you want to know as much as possible.
Here are some thoughts on how to log. As I said, better to do this in development then to dedicate a large portion of a maintenance release (like we did) to getting it right:
- Write a brief doc / WIKI entry on logging policy. Get agreement on what categories to log to and when (e.g., DEBUG, INFO, ERROR)
- Service entry/exit logging
- In the original event source, generate a UUID. Put this in the event header
- Log the UUID, event type, status (if applicable), and other appropriate meta data
- If you must log the payload of the event when there is an error, scrub the private information (e.g., SSN, DOB, etc.)
- If you are using error queues, you have no reason to log the event body. Just log the stack trace, event meta data, and event history. The error message should also contain the stack trace, event meta data, and event history. If you have an Error Queue Admin tool like we do - it will protect the sensitive information
- Keep the log config files out of the artifacts (e.g., .jar) that you deploy so that if there is panic in prod, you can turn on DEBUG by editing the config