pcap segment violation

Discussion:

Daniel H. Bahr

2014-02-12 19:42:56 UTC

Hello everyone, I really hope this is the right place to post this
since I did not find a pcap-specific mailing list, if not: could
someone point me in the right direction?

So this is the thing:
I'm working on a fairly simple network passive monitor with simple
algorithms dependant mostly on packet capture time and such. I didn't
wan't to use the existing java bindings since I only needed a fairly
small set of the jnetpcap features and thought myself capable of
recreating those I required.
So I used the jni to add a library (code after message) for capturing
packets and notifying the java application of them.

Now it has been displaying an erratic behaviour in which after a
number of packets captured it would crash with a SIGSEGV.

For instance 1999 packets a first time, 5178 packets a second time and
so on (random quantities of packets before the SIGSEGV raised)... all
of this different "crash times" happened in the exact same
environment: two virtualbox boxes, one with a dummy service with an
open socketserver and the other with a dummy app connecting to that
socket and flushing short strings up and down, and the monitor along
with the dummy app monitoring only packets arriving from the dummy
service.

Does this make sense to any one... I am posting the relevant parts of
the C code in the hopes that I am doing something really stupid and
anyone can get a glimpse at it.

Best regards an thanks in advance for taking the time to read this.

================================================================

/*
* net_provinfor_middleware_facility_NetworkInfo_LibPKTInfo_NetworkMonitor.c
*
* Created on: Oct 23, 2013
* Author: D.H. Bahr
*/

#include "LibPKTInfo_PKTMonitor.h"
#include "definition.h"

pcap_t* pd;
int linkhdrlen;
JNIEnv *jni;
jobject this;

int bailed = 0;

JNIEXPORT void JNICALL Java_LibPKTInfo_PKTMonitor_startSniffing
(JNIEnv *env, jobject thisObject, jstring iface, jstring filter) {
jni = env;
this = thisObject;

int packets = -1;

const char *device = (*env)->GetStringUTFChars(env, iface, NULL);
const char *bpfstr = (*env)->GetStringUTFChars(env, filter, NULL);

// Starting pcap socket opening
char errbuf[PCAP_ERRBUF_SIZE];
uint32_t srcip, netmask;
struct bpf_program bpf;

// If no network interface (device) is specfied, get the first one.
if (!*device && !(device = pcap_lookupdev(errbuf)))
{
printf("pcap_lookupdev(): %s\n", errbuf);
return;
}

// Get network device source IP address and netmask.
if (pcap_lookupnet(device, &srcip, &netmask, errbuf) < 0)
{
printf("pcap_lookupnet: %s\n", errbuf);
return;
}

// Open the device for live capture, as opposed to reading a packet
// capture file.
if ((pd = pcap_open_live(device, BUFSIZ, 1, 0, errbuf)) == NULL)
{
printf("pcap_open_live(): %s\n", errbuf);
return;
}

// Convert the packet filter expression into a packet
// filter binary.
if (pcap_compile(pd, &bpf, (char*)bpfstr, 0, netmask))
{
printf("pcap_compile(): %s\n", pcap_geterr(pd));
return;
}

// Assign the packet filter to the given libpcap socket.
if (pcap_setfilter(pd, &bpf) < 0)
{
printf("pcap_setfilter(): %s\n", pcap_geterr(pd));
return;
}

signal (SIGTERM, bailout);
signal (SIGQUIT, bailout);
signal (SIGSEGV, bailout);

capture_loop (pd, packets, (pcap_handler)parse_packet);
bailout (0);

}

JNIEXPORT void JNICALL Java_LibPKTInfo_PKTMonitor_stopSniffing
(JNIEnv *env, jobject thisObject) {
bailout(0);
}

void capture_loop(pcap_t* pd, int packets, pcap_handler func) {
int linktype;

// Determine the datalink layer type.
if ((linktype = pcap_datalink(pd)) < 0) {
printf("pcap_datalink(): %s\n", pcap_geterr(pd));
return;
}

// Set the datalink layer header size.
switch (linktype) {
case DLT_NULL:
linkhdrlen = 4;
break;

case DLT_EN10MB:
linkhdrlen = 14;
break;

case DLT_SLIP:
case DLT_PPP:
linkhdrlen = 24;
break;

default:
return;
}

// Start capturing packets.
if (pcap_loop(pd, packets, func, 0) < 0)
printf("pcap_loop failed: %s\n", pcap_geterr(pd));
}

void bailout(int signo) {
printf ("[libpktinfo.so] bailout caught with signal: %d\n", signo);
printf ("[libpktinfo.so] signal legend:\n");
printf ("[libpktinfo.so] \tINT:%d\n", SIGINT);
printf ("[libpktinfo.so] \tTERM:%d\n", SIGTERM);
printf ("[libpktinfo.so] \tQUIT:%d\n", SIGQUIT);
printf ("[libpktinfo.so] \tSEGV:%d\n", SIGSEGV);
if (bailed == 0) {
bailed = 1;
printf ("[libpktinfo.so] bailing out...");
pcap_close(pd);
printf ("done!\n");
} else {
printf ("[libpktinfo.so] already bailed, skipping...\n");
}
}

void parse_packet(u_char *user, struct pcap_pkthdr *packethdr,
u_char *packetptr) {
struct ip* iphdr;
char iphdrInfo[256], srcip[256], dstip[256];
unsigned short id, seq;

jclass klass = (*jni)->GetObjectClass(jni, this);
jmethodID midCallBack = (*jni)->GetMethodID(jni, klass, "newPacketArrived",
"(<signature here>)V");
if (NULL == midCallBack) {
return;
}

// Skip the datalink layer header and get the IP header fields.
packetptr += linkhdrlen;
iphdr = (struct ip*)packetptr;

packetptr += 4*iphdr->ip_hl;
int ID = ntohs(iphdr->ip_id);

...

(*jni)->CallVoidMethod(jni, this, midCallBack, ID, <params here>);
}

Guy Harris

2014-02-12 21:55:43 UTC

Permalink

Post by Daniel H. Bahr
Hello everyone, I really hope this is the right place to post this
since I did not find a pcap-specific mailing list,

There aren't separate tcpdump or libpcap mailing lists. This list is for all tcpdump-related and libpcap-related mail, regardless of whether they're about using tcpdump/libpcap (you're in the "using libpcap" group) or developing tcpdump/libpcap.

Post by Daniel H. Bahr
Now it has been displaying an erratic behaviour in which after a
number of packets captured it would crash with a SIGSEGV.

On what OS is this? (I ask because not all OSes use SIGSEGV and SIGBUS for the same purposes, and because, with some versions of some OSes and libpcap, the packet data handed to your callback will, I think, be write-protected - see below.)

What type of processor is this running on? (I ask because there might be alignment issues on some processors, although I'd expect a SIGBUS, rather than a SIGSEGV, for alignment issues.)

Can you get a stack trace to see where the SIGSEGV is happening?

Post by Daniel H. Bahr
capture_loop (pd, packets, (pcap_handler)parse_packet);

The cast to (pcap_handler) shouldn't be necessary...

Post by Daniel H. Bahr
void parse_packet(u_char *user, struct pcap_pkthdr *packethdr,
u_char *packetptr) {

...and *won't* be necessary if you change that to

void parse_packet(u_char *user, const struct pcap_pkthdr *packethdr,
const u_char *packetptr) {

If your code fails to compile after you make that change, your code may be incorrect; libpcap does not guarantee that you will be able to modify either the struct pcap_pkthdr handed to you or the raw packet data handed to you, and, if you try to do so, Bad Things could happen (including a crash, if that data is in a region of your address space not writable by you).

Guy Harris

2014-02-12 22:21:40 UTC

Permalink

Note also that there is *NO* guarantee that the struct pcap_pkthdr or packet data pointers handed to your callback remain valid after it returns, so those pointers must not be saved by your callback or anything it calls.

Daniel H. Bahr

2014-02-13 13:11:01 UTC

Permalink

First of all thanks for your replies and sorry for the delay on mine.

"On what OS is this?"
Both Debian 7 and Ubuntu 12.04 (though I really think they represent
the same to you).

"What type of processor is this running on?"
amd64 bit processors. The Ubuntu box runs on a QuadCore (the Debian
one I am not sure, but it should be something like that).

"Can you get a stack trace to see where the SIGSEGV is happening?"
I'll unplug the SIGSEGV and see if I get a traceback or something as
soon as I get to the office, and I'll take your proposals into
consideration, see what happens.

"Maybe is related to the fact that you are accessing a global
reference to a java object across JNI calls."
I'll check that out as well and let you know ...

Best regards, and thanks again for your concern.

Maybe is related to the fact that you are accessing a global reference to a
java object across JNI calls. I'm talking about the "jobject this" defined
at the begin of the code.
I think the garbage collector is allowed to move it. That make sense?
Maybe you can try your code after setting a global reference to it with the
env->NewGlobalRef method.

Post by Guy Harris
Note also that there is *NO* guarantee that the struct pcap_pkthdr or
packet data pointers handed to your callback remain valid after it
returns,
so those pointers must not be saved by your callback or anything it
calls.
_______________________________________________
tcpdump-workers mailing list
https://lists.sandelman.ca/mailman/listinfo/tcpdump-workers

Daniel H. Bahr

2014-02-13 15:21:15 UTC

Permalink

This keeps getting weirder...

Just unplugged the SIGSEGV signal to get a stacktrace upon its
occurrence and I've performed 3 complete cycle (that is 20000 packets)
simulations without getting any buggy behavior.

Is it at all possible that the Segment Violation signal that triggered
the bailout was emitted from a process other than this one?

Post by Daniel H. Bahr
First of all thanks for your replies and sorry for the delay on mine.
"On what OS is this?"
Both Debian 7 and Ubuntu 12.04 (though I really think they represent
the same to you).
"What type of processor is this running on?"
amd64 bit processors. The Ubuntu box runs on a QuadCore (the Debian
one I am not sure, but it should be something like that).
"Can you get a stack trace to see where the SIGSEGV is happening?"
I'll unplug the SIGSEGV and see if I get a traceback or something as
soon as I get to the office, and I'll take your proposals into
consideration, see what happens.
"Maybe is related to the fact that you are accessing a global
reference to a java object across JNI calls."
I'll check that out as well and let you know ...
Best regards, and thanks again for your concern.
2014-02-12 18:21 GMT-05:00, Esteban Pellegrino

Guy Harris

2014-02-13 18:08:21 UTC

Permalink

Post by Daniel H. Bahr
This keeps getting weirder...
Just unplugged the SIGSEGV signal to get a stacktrace upon its
occurrence and I've performed 3 complete cycle (that is 20000 packets)
simulations without getting any buggy behavior.
Is it at all possible that the Segment Violation signal that triggered
the bailout was emitted from a process other than this one?

It's not possible that a SIGSEGV that causes a process to crash was emitted from a process other than the one that crashes - that's not how UN*Xes work.

It is not impossible that a process could send another process a SIGSEGV with a kill() call, but it's sufficiently unlikely that I really don't think that's what's happening.

It could be that this is a threading issue (as Michael noted, libpcap is not thread-safe; it can be run safely in a multi-threaded process only if

1) you don't do pcap_compile() simultaneously in multiple threads

and

2) you don't do operations on any particular pcap_t * simultaneously in multiple threads (you can do operations on one pcap_t * - except for pcap_compile() - in one thread and operations on another pcap_t * - again, except for pcap_compile() - in another thread, but you can't do operations on any individual pcap_t * simultaneously in more than one thread).

I don't know whether you can run Java inside a debugger, but, if you can, try putting the "catch SIGSEGV" code back and run your Java program inside a debugger (if necessary, use the right debugger commands to make sure the SIGSEGV is first handled by the debugger, and that the debugger stops the process so that you can get a stack trac even though the program is catching it).

Michael Richardson

2014-02-13 17:29:30 UTC

Permalink

The other thought I have is that java is heavily threaded, while libpcap is
not thread safe. pcap_loop() is going to block.
I see that your jni variable is a global... I wonder about that.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

Daniel H. Bahr

2014-02-13 18:11:52 UTC

Permalink

I see what you mean, but the native startSniffing method is invoked
from a nested inner Thread. That is:

Java Main Thread {
do stuff...
Nested Outer Thread {
do more stuff...
Nested Inner Thread {
startSniffing here...
}
}
}

Post by Michael Richardson
The other thought I have is that java is heavily threaded, while libpcap is
not thread safe. pcap_loop() is going to block.
I see that your jni variable is a global... I wonder about that.
--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect
[
[

Daniel H. Bahr

2014-02-13 18:23:01 UTC

Permalink

Guy,

my previous reply was sent before I saw your last message.

There IS a chance more than one instance of the Object owning the
native methods would be created IF there would be need to sniff at
several network interfaces simultaneously; in which case there would
be a single instance of the class for each network interface to be
sniffed.

Could this raise the issues you mention above?

Post by Daniel H. Bahr
I see what you mean, but the native startSniffing method is invoked
Java Main Thread {
do stuff...
Nested Outer Thread {
do more stuff...
Nested Inner Thread {
startSniffing here...
}
}
}

Post by Michael Richardson
The other thought I have is that java is heavily threaded, while libpcap is
not thread safe. pcap_loop() is going to block.
I see that your jni variable is a global... I wonder about that.
--
] Never tell me the odds! | ipv6 mesh
networks
[
] Michael Richardson, Sandelman Software Works | network architect
[
[

Daniel H. Bahr

2014-02-13 18:24:09 UTC

Permalink

Probably worth noting is the fact that the times I experienced the
buggy behavior there was only one sniffer up and running...

Post by Daniel H. Bahr
Guy,
my previous reply was sent before I saw your last message.
There IS a chance more than one instance of the Object owning the
native methods would be created IF there would be need to sniff at
several network interfaces simultaneously; in which case there would
be a single instance of the class for each network interface to be
sniffed.
Could this raise the issues you mention above?

Post by Michael Richardson
The other thought I have is that java is heavily threaded, while libpcap is
not thread safe. pcap_loop() is going to block.
I see that your jni variable is a global... I wonder about that.
--
] Never tell me the odds! | ipv6 mesh
networks
[
] Michael Richardson, Sandelman Software Works | network architect
[
[

Guy Harris

2014-02-13 18:59:53 UTC

Permalink

This post might be inappropriate. Click to display it.

Michael Richardson

2014-02-13 21:16:16 UTC

Permalink

Post by Daniel H. Bahr
my previous reply was sent before I saw your last message.
There IS a chance more than one instance of the Object owning the
native methods would be created IF there would be need to sniff at
several network interfaces simultaneously; in which case there would
be a single instance of the class for each network interface to be
sniffed.
Could this raise the issues you mention above?

You have only one *jni variable, that might be a problem if it will be
different in different threads/contexts.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] ***@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

Daniel H. Bahr

2014-02-13 21:24:09 UTC

Permalink

Well, I tried to debug the thing from eclipse but the crash could not
be caught so I couldn't get the stack trace, I'll try and do that
again later.

For some reason, as I said earlier, if the SIGSEGV is not connected to
the bailout nothing queer happens, I've run some large simulations and
everything works out fine, as opposed as when I connect the signal in
which case it crashes every time.

I'll also take a look at the "singleton" *jni and see how to work that around.

Best regards to everyone

Post by Michael Richardson

You have only one *jni variable, that might be a problem if it will be
different in different threads/contexts.
--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect
[
[

Guy Harris

2014-02-13 21:36:09 UTC

Permalink

Post by Daniel H. Bahr
For some reason, as I said earlier, if the SIGSEGV is not connected to
the bailout nothing queer happens,

Even if you leave SIGQUIT and SIGTERM connected?

Daniel H. Bahr

2014-02-14 13:26:19 UTC

Permalink

Yes, even so

Post by Guy Harris

Post by Daniel H. Bahr
For some reason, as I said earlier, if the SIGSEGV is not connected to
the bailout nothing queer happens,

Even if you leave SIGQUIT and SIGTERM connected?