Multi Core Testing 1

Come 2009, we will almost certainly see quad core with 2 hardware thread in common use. This gives us eight execution threads. It would be almost criminal is a CPU hungry system like Trisul (and also Snort, Ntop, etc) ignore the seven available threads.

The initial release of Trisul is multithreaded. The implementation is rather naive and traditional. The idea is we refine the system continuously based on feedback from testing with the Intel Thread Profiler.

In this page , we will look at how the initial release of Trisul (Rel 0.4.40) behaves when analyzing a large capture file.

Conclusion : This test concludes that despite multithreading, Trisul exhibits a strong serial performance. To fix this, we can either break it up into more threads carefully sharing almost equal chunks of work, or by moving to a “work stealing” task based framework like Intel Threading Building Blocks.

Trisul Threads

The threading model of Trisul is very simple.

The two threads that are in the main processing path are the Packet Capture Thread and the Ring Manager Thread. (see RingCaptureMgr.cpp and SniffMgr.cpp in the Trisul source code)

Num Thread Name What it does Comments
1 Packet Capture (KI_ThreadPktCapture) Acquires packets from a file or network interface and sticks it in a circular buffer+queue
2 Ring Manager (KI_ThreadRingManager) Reads packets off the buffer+queue. It runs the packet through (1) protocol tree building (2) metering (3) session tracking (4) saving raw packets and (5) pruning memory by reducing meters to the SQLITE database As you can see this thread does almost all the heavy lifting.
3 TRP Server Trisul Remote Protocol server. This thread is mostly idle until a Trisul Remote Protocol client (such as Unsniff 2.0) connects to it. It then spawns a worker thread to interact with the client via TLS. Mostly idle
4 SQLITE3 threadsFor some reason SQLITE3 creates two threads threadLockingTest1 and threadLockingTest2 (see image below). They are idle. We can ignore these idle threads

Running the sample test

The test consists of running Trisul on a large capture file representing about 300Mbytes of real traffic representing an hours worth of data.

To run the test :

  • Change to the /usr/local/bin directory
  • First, get the command line help
[vivek@localhost bin]$ ./trisul --help
Usage: trisul [--version] | [-demon|-nodemon]  /path/to/config/file [-offline <capfile>]
[vivek@localhost bin]$ 
  • We want the offline command line option. We also want to time the run using the time shell command.
[vivek@localhost bin]$ time ./trisul -nodemon /usr/local/etc/trisul/trisulConfig.xml -offline /home/vivek/tdata/lbl1.pcap 
real    0m44.109s
user    0m21.111s
sys     0m4.992s

We ran the above test using the Intel Thread Profiler Collector for Linux.

Results

We used the Intel Thread Profiler on Windows with the remote Linux Data Collector to analyze the results.

See the following links for more information about interpreting the Intel(R) Thread Profiler results

Trisul was run on a dual core AMD 64 running Fedora 7 with 1GB RAM.

We look at Concurrency Level and Thread Activity in this test.

Concurrency level

The tool uses five concurrency levels to classify level of parallel activity.

CL:0 - Idle CL:1 - Serial ( ←- only one thread is active) CL:2 - Undersubscribed CL:3 - Parallel ( ←- Ideal goal ) CL:4 - Oversubscribed

The chart shows histograms for the time spent by the application at each concurrency level.

Looking at our results, we find that almost all our time is spent in CL:1. This tells us that Trisul is essentially a serial process despite having two threads. This is not surprising considering the work imbalance between the two threads.

As we move forward, our goal should be to get the largest possible bar at CL:3.

Thread activity

The following picture breaks up thread activity. A quick explanation.

Gray areas

The top and bottom gray bars indicate idle time. This means Thread “1” and Thread “KI_ThreadPulseServer” are not working at all. This is not surprising because “Thread 1” is the main application thread that just wakes up once in a few minutes to provide a metronome service. “Thread KI_ThreadPulseServer” is idle because its only job is to service remote clients via the TRP.

White areas

The two white bars ( “2:threadLockingTest” and “3:threadLockingTest”) are initialized but never run. These threads originate from the SQLITE3 software. We can ignore them as if they dont exist at all.

Action

We see that most of the action is at “4:KI_ThreadPktCapture” and “5:KI_ThreadRingManager”. The yellow lines indicate overhead (a problem area). They are caused by the extensive use of synchronization by Trisul to co-ordinate access to the incoming buffer-queue structure.

The serial bottleneck

The Thread Profiler allows you to set the primary grouping to “Threads” instead of “Concurrency Level”. If we do that we see that the KI_ThreadRingManager has a huge serial impact. It spends 88.99% time in the critical path (defined as the end of the run). It also has a serial impact of 87.62%. There is really not much parallel activity going on.

Zoom in to the busy area

If we zoom into the timeline view we find that even among the two active threads, one of them spends most of its time in the gray area (idle). This is probably because of imbalance in the quantum of work for each token (packet).

Conclusions

The initial release of Trisul exhibits a strong serial behavior even when multiple cores are available.

Rearchitecture is required, not with more threading but by taking a “work stealing” approach. The options are really limited here.

The latter is a clear choice because they have smart folks who have figured out most of the hard parts.

Packet processing is hard because of the volume of packets. Too many threads will cause thrashing and cache issues. The best way forward would be to try out the TBB approach.

 
multi-core_1.txt · Last modified: 2008/06/12 03:57 by vivek
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki