July 7 2017
Multi-Processor (MP) systems make trace logging difficult. The cycle counter is duplicated on each processor (or core) and there is no synchronization between them. Even worse, the power management system may scale the clock on each cpu independently, sometimes very often, resulting in wildly different counts on different cpus for the same real-world time span. It is random luck which cpu is being used when the timestamp is requested.
This is a particularly nasty problem, and is (I'm guessing) why even Windows7 was still using the PCI bus clock (running at a paltry 14MHz) instead of the CPU cycle counter (claiming to run at 4GHz) for QueryPerformanceCounter().
It has been many years since I researched this problem on the x86 platform, back in the days of the Pentium II. The only real solution at that time was to lock the threads to a particular cpu and store both the cycle count and the real-time clock in the trace logs, and rely on the user to correlate cycle counts to real-time in the log viewer. Trying to map cycles to time was an exercise in futility.
Storing both cycles and seconds almost doubles the storage required. There are simple approaches to compressing the data, such as using only the deltas from the previous entry and storing only the significant bits (which inflates the size of each entry by requiring me to store the number of bits stored!).