[Winpcap-users] pcap_open_offline and unicode charsets

Wed Nov 11 07:19:19 PST 2009

Thanks for your deep insight into the issue.

> he's using Windows (as per the reference to MinGW and
> jnetpcap.dll), so his problem may ultimately be caused by the
> lack of pcap_wopen_offline().

I agree.

Java's native code interface (JNI) provides support for converting the java
strings to unicode or UTF-8:
http://java.sun.com/javase/6/docs/technotes/guides/jni/spec/functions.html#s
tring_operations

So when the support for this comes from libpcap/winpcap I will be ready.

I can experiment some more with windows wide char to java string conversion
in my getHardwareAddress function for an interface:

		/*
		 * Name is in wide character format. So convert to plain
UTF8.
		 */
		int size=WideCharToMultiByte(0, 0, map->Name, -1, NULL, 0,
NULL, NULL);
		char utf8[size + 1];
		WideCharToMultiByte(0, 0, map->Name, -1, utf8, size, NULL,
NULL);

(Source starts on line 567:
http://jnetpcap.svn.sourceforge.net/viewvc/jnetpcap/jnetpcap/trunk/src/c/jne
tpcap_utils.cpp?view=markup)

Now that I think about I should be able to convert directly to java string
from window's wide-char string without the extra step of going through UTF-8
as median between the 2.

Cheers,
mark...

> -----Original Message-----
> From: Guy Harris [mailto:guy at alum.mit.edu]
> Sent: Sunday, November 08, 2009 5:28 PM
> To: voytechs at yahoo.com; winpcap-users at winpcap.org
> Subject: Re: [Winpcap-users] pcap_open_offline and unicode charsets
>
>
> On Nov 8, 2009, at 12:55 PM, Mark Bednarczyk wrote:
>
> >>> My library gets its filename from a java string and it currently
> >>> converts it to plain UTF-8 charset and that works fine.
> >>
> >> On UN*X, it should perhaps be converted to whatever the locale's
> >> filename character set is.
> >
> > But I don't actually call on any fopen calls directly. I rely on
> > libpcap to work with the filesystem. Therefore I would like
> to go by
> > the specs the libpcap provides for the pcap_open_offline call. It
> > would be nice to somehow handle and provide a definitive
> specification
> > when passing in a string.
>
> The definitive specification is "it calls fopen(), so it does
> the same thing as fopen()".
>
> *If* a file name happens to be encoded, in the file system,
> using UTF-8, you would hand that UTF-8 string to fopen() to
> open it, so you would do the same with pcap_open_offline().
> If, instead, it happens to be encoded using ISO 8859/1, or
> 8859/2, or 8859/15, or..., or KOI-8, or Shift-JIS, or EUJIS,
> or..., you'd hand a string in *that* encoding.  (Sorry, but
> UN*X internationalization antedated Unicode, so they had to
> do *something*, and ended up doing a variety of different
> things in different locales.  Oh, and don't get me started
> about Unicode normalization forms....)
>
> >
> >>
> >> I'm not sure how that would be determined, however.  I might be
> >> tempted to assume that, if the environment variable
> LC_CTYPE is set
> >> that specifies the encoding, otherwise if LANG is set that
> specifies
> >> the encoding, otherwise it might be the C locale (which, I think,
> >> unfortunately says the encoding is ASCII).  However, GLib
> (not glibc,
> >> GLib) has its own additional environment variables:
> >>
> >> 	http://library.gnome.org/devel/glib/stable/glib-running.html
> >>
> >> and I'm not sure why that's the case.
> >>
> >>> But in reality I'd like to support all unicode widths 8,
> 16 and even
> >>> 32 bit. I'm not sure how those wider unicode chars would
> be handled.
> >>
> >> How are they handled elsewhere in Java?  The File class
> seems to work
> >> with Strings, and the String class, at least as I understand the
> >> documentation, uses UTF-16 (presumably that's what you mean by
> >> "unicode [width] ... 16 ... bit").=
> >
> > Java has extensive unicode support for even the extended unicode
> > widths where they combine 2 UTF-16 chars to describe a single
> > character.
>
> If that's "surrogate pairs", that's more like "combining two
> 16-bit codes" - a surrogate pair is a single character,
> represented as two "code units":
>
> 	http://unicode.org/standard/principles.html
>
> "Encoding Forms
>
> Character encoding standards define not only the identity of
> each character and its numeric value, or code point, but also
> how this value is represented in bits.
>
> The Unicode Standard defines three encoding forms that allow
> the same data to be transmitted in a byte, word or double
> word oriented format (i.e. in 8, 16 or 32-bits per code
> unit). All three encoding forms encode the same common
> character repertoire and can be efficiently transformed into
> one another without loss of data. The Unicode Consortium
> fully endorses the use of any of these encoding forms as a
> conformant way of implementing the Unicode Standard.
>
> UTF-8 is popular for HTML and similar protocols. UTF-8 is a
> way of transforming all Unicode characters into a variable
> length encoding of bytes. It has the advantages that the
> Unicode characters corresponding to the familiar ASCII set
> have the same byte values as ASCII, and that Unicode
> characters transformed into UTF-8 can be used with much
> existing software without extensive software rewrites.
>
> UTF-16 is popular in many environments that need to balance
> efficient access to characters with economical use of
> storage. It is reasonably compact and all the heavily used
> characters fit into a single 16-bit code unit, while all
> other characters are accessible via pairs of 16- bit code units.
>
> UTF-32 is popular where memory space is no concern, but fixed
> width, single code unit access to characters is desired. Each
> Unicode character is encoded in a single 32-bit code unit
> when using UTF-32.
>
> All three encoding forms need at most 4 bytes (or 32-bits) of
> data for each character."
>
> At least as I read the description of the String class:
>
> 	http://java.sun.com/javase/6/docs/api/java/lang/String.html
>
> it's based on UTF-16:
>
> "A String represents a string in the UTF-16 format in which
> supplementary characters are represented by surrogate pairs
> (see the section Unicode Character Representations in the
> Character class for more information). Index values refer to
> char code units, so a supplementary character uses two
> positions in a String.
>
> The String class provides methods for dealing with Unicode
> code points (i.e., characters), in addition to those for
> dealing with Unicode code units (i.e., char values)."
>
> > Here is how java represents unicode characters:
> >
> > The char data type (and therefore the value that a Character object
> > encapsulates) are based on the original Unicode
> specification, which
> > defined characters as fixed-width 16-bit entities.
>
> Meaning it can't handle characters outside the BMP.
>
> However, from your example in "Decoding packets manually":
>
> 	String file = "capturefile.pcap";
>
> 		...
>
> 	Pcap pcap = Pcap.openOffline(file, errbuf);
>
> it appears that you use Strings for pathnames.  As per my
> earlier mail, pathnames seem to be Strings, hence
> UTF-16-encoded, so the pathnames you'll be handed are UTF-16,
> not UCS-2 (UCS-2 encodes only the BMP, with one 16-bit code
> unit per code point).
>
> > So in summary, I think the answer is that UTF-8 is
> supported on all/
> > most platforms and filesystem types right now.
>
> It's supported on UN*Xes where file names happen to be
> encoded in UTF-8.  Mac OS X does that (in fact, that's all
> that's supported in HFS
> +, although, *on disk*, HFS+ uses, I think, UTF-16, but what you see
> in the UN*X APIs is UTF-8; the OS X SMB client assumes all
> file names are UTF-8, mapping them to UTF-16 over the wire
> and mapping stuff received from over the wire from UTF-16
> back to UTF-8).  Other UN*Xes probably allow other encodings,
> hence my comment about mapping from
> UTF-8 to the native file name encoding.
>
> On Windows, however, it's not going to work - on Windows, I
> don't think fopen() takes UTF-8-encoded pathnames, I think it
> takes pathnames encoded in whatever the current "code page"
> is.  That means that there could be unopenable files (e.g.,
> if your current code page is an Asian DBCS code page, you
> probably won't be able to open a file named "Müller's network
> problem.pcap").
>
> You'd need pcap_wopen_offline(), or something such as that,
> to fully support Unicode pathnames.
>
> > The UTF-16 which is what my user is
> > using for some chineese characters in filename will not work with
> > libpcap's pcap_open_offline(). The platform he is on is ubuntu
>
> ...which, being a Linux distribution, and hence a UN*X,
> expects pathnames to be sequences of octets, with '/' as
> separators and '\0'
> as a terminator.  Handing it a UTF-16 string isn't going to
> work very well.
>
> *If* the file's name is encoded with UTF-8, handing it a
> UTF-8 string should work.  If it's encoded in some other
> encoding, such as Big5:
>
> 	http://en.wikipedia.org/wiki/Big5
>
> or GB 2312:
>
> 	http://en.wikipedia.org/wiki/GB2312
>
> it probably won't work.
>
> > I'm not sure what application created the file in the first place.
> > May be we can discern if fopen was used to created the file using
> > UTF-16
> > encoding or some other system call.
>
> Ultimately, the system call used to create the file was
> either open() or creat() (and the former is a superset of the
> latter); they take octet strings in some superset-of-ASCII
> encoding (UTF-8, ISO 8859/x, Big5, GB 2312, Shift JIS, etc.),
> so that all octets in the range 0x00 through 0x7F represent
> the corresponding ASCII character, and only octets with the
> 0x80 bit set are used to encode other characters.
>
> The issue probably doesn't involve UTF-16, as that's not a
> octet- string superset-of-ASCII encoding; it probably
> involves UTF-8 vs. some other encoding of Chinese.
>
> As for the other user who filed
>
> 	http://jnetpcap.com/node/456
>
> he's using Windows (as per the reference to MinGW and
> jnetpcap.dll), so his problem may ultimately be caused by the
> lack of pcap_wopen_offline().
> >
> > Cheers,
> > mark..
> > http://jnetpcap.com
> >
> >
> > _______________________________________________
> > Winpcap-users mailing list
> > Winpcap-users at winpcap.org
> > https://www.winpcap.org/mailman/listinfo/winpcap-users
>