[Winpcap-users] pcap_open_offline and unicode charsets

Guy Harris guy at alum.mit.edu
Sun Nov 8 14:28:28 PST 2009


On Nov 8, 2009, at 12:55 PM, Mark Bednarczyk wrote:

>>> My library gets its filename from a java string and it currently
>>> converts it to plain UTF-8 charset and that works fine.
>>
>> On UN*X, it should perhaps be converted to whatever the
>> locale's filename character set is.
>
> But I don't actually call on any fopen calls directly. I rely on  
> libpcap to
> work with the filesystem. Therefore I would like to go by the specs  
> the
> libpcap provides for the pcap_open_offline call. It would be nice to  
> somehow
> handle and provide a definitive specification when passing in a  
> string.

The definitive specification is "it calls fopen(), so it does the same  
thing as fopen()".

*If* a file name happens to be encoded, in the file system, using  
UTF-8, you would hand that UTF-8 string to fopen() to open it, so you  
would do the same with pcap_open_offline().  If, instead, it happens  
to be encoded using ISO 8859/1, or 8859/2, or 8859/15, or..., or  
KOI-8, or Shift-JIS, or EUJIS, or..., you'd hand a string in *that*  
encoding.  (Sorry, but UN*X internationalization antedated Unicode, so  
they had to do *something*, and ended up doing a variety of different  
things in different locales.  Oh, and don't get me started about  
Unicode normalization forms....)

>
>>
>> I'm not sure how that would be determined, however.  I might
>> be tempted to assume that, if the environment variable
>> LC_CTYPE is set that specifies the encoding, otherwise if
>> LANG is set that specifies the encoding, otherwise it might
>> be the C locale (which, I think, unfortunately says the
>> encoding is ASCII).  However, GLib (not glibc,
>> GLib) has its own additional environment variables:
>>
>> 	http://library.gnome.org/devel/glib/stable/glib-running.html
>>
>> and I'm not sure why that's the case.
>>
>>> But in reality I'd like to support all unicode widths 8, 16 and even
>>> 32 bit. I'm not sure how those wider unicode chars would be handled.
>>
>> How are they handled elsewhere in Java?  The File class seems
>> to work with Strings, and the String class, at least as I
>> understand the documentation, uses UTF-16 (presumably that's
>> what you mean by "unicode [width] ... 16 ... bit").=
>
> Java has extensive unicode support for even the extended unicode  
> widths where
> they combine 2 UTF-16 chars to describe a single character.

If that's "surrogate pairs", that's more like "combining two 16-bit  
codes" - a surrogate pair is a single character, represented as two  
"code units":

	http://unicode.org/standard/principles.html

"Encoding Forms

Character encoding standards define not only the identity of each  
character and its numeric value, or code point, but also how this  
value is represented in bits.

The Unicode Standard defines three encoding forms that allow the same  
data to be transmitted in a byte, word or double word oriented format  
(i.e. in 8, 16 or 32-bits per code unit). All three encoding forms  
encode the same common character repertoire and can be efficiently  
transformed into one another without loss of data. The Unicode  
Consortium fully endorses the use of any of these encoding forms as a  
conformant way of implementing the Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of  
transforming all Unicode characters into a variable length encoding of  
bytes. It has the advantages that the Unicode characters corresponding  
to the familiar ASCII set have the same byte values as ASCII, and that  
Unicode characters transformed into UTF-8 can be used with much  
existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient  
access to characters with economical use of storage. It is reasonably  
compact and all the heavily used characters fit into a single 16-bit  
code unit, while all other characters are accessible via pairs of 16- 
bit code units.

UTF-32 is popular where memory space is no concern, but fixed width,  
single code unit access to characters is desired. Each Unicode  
character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for  
each character."

At least as I read the description of the String class:

	http://java.sun.com/javase/6/docs/api/java/lang/String.html

it's based on UTF-16:

"A String represents a string in the UTF-16 format in which  
supplementary characters are represented by surrogate pairs (see the  
section Unicode Character Representations in the Character class for  
more information). Index values refer to char code units, so a  
supplementary character uses two positions in a String.

The String class provides methods for dealing with Unicode code points  
(i.e., characters), in addition to those for dealing with Unicode code  
units (i.e., char values)."

> Here is how java represents unicode characters:
>
> The char data type (and therefore the value that a Character object
> encapsulates) are based on the original Unicode specification, which  
> defined
> characters as fixed-width 16-bit entities.

Meaning it can't handle characters outside the BMP.

However, from your example in "Decoding packets manually":

	String file = "capturefile.pcap";

		...

	Pcap pcap = Pcap.openOffline(file, errbuf);

it appears that you use Strings for pathnames.  As per my earlier  
mail, pathnames seem to be Strings, hence UTF-16-encoded, so the  
pathnames you'll be handed are UTF-16, not UCS-2 (UCS-2 encodes only  
the BMP, with one 16-bit code unit per code point).

> So in summary, I think the answer is that UTF-8 is supported on all/ 
> most
> platforms and filesystem types right now.

It's supported on UN*Xes where file names happen to be encoded in  
UTF-8.  Mac OS X does that (in fact, that's all that's supported in HFS 
+, although, *on disk*, HFS+ uses, I think, UTF-16, but what you see  
in the UN*X APIs is UTF-8; the OS X SMB client assumes all file names  
are UTF-8, mapping them to UTF-16 over the wire and mapping stuff  
received from over the wire from UTF-16 back to UTF-8).  Other UN*Xes  
probably allow other encodings, hence my comment about mapping from  
UTF-8 to the native file name encoding.

On Windows, however, it's not going to work - on Windows, I don't  
think fopen() takes UTF-8-encoded pathnames, I think it takes  
pathnames encoded in whatever the current "code page" is.  That means  
that there could be unopenable files (e.g., if your current code page  
is an Asian DBCS code page, you probably won't be able to open a file  
named "Müller's network problem.pcap").

You'd need pcap_wopen_offline(), or something such as that, to fully  
support Unicode pathnames.

> The UTF-16 which is what my user is
> using for some chineese characters in filename will not work with  
> libpcap's
> pcap_open_offline(). The platform he is on is ubuntu

...which, being a Linux distribution, and hence a UN*X, expects  
pathnames to be sequences of octets, with '/' as separators and '\0'  
as a terminator.  Handing it a UTF-16 string isn't going to work very  
well.

*If* the file's name is encoded with UTF-8, handing it a UTF-8 string  
should work.  If it's encoded in some other encoding, such as Big5:

	http://en.wikipedia.org/wiki/Big5

or GB 2312:

	http://en.wikipedia.org/wiki/GB2312

it probably won't work.
	
> I'm not sure what application created the file in the first place.
> May be we can discern if fopen was used to created the file using  
> UTF-16
> encoding or some other system call.

Ultimately, the system call used to create the file was either open()  
or creat() (and the former is a superset of the latter); they take  
octet strings in some superset-of-ASCII encoding (UTF-8, ISO 8859/x,  
Big5, GB 2312, Shift JIS, etc.), so that all octets in the range 0x00  
through 0x7F represent the corresponding ASCII character, and only  
octets with the 0x80 bit set are used to encode other characters.

The issue probably doesn't involve UTF-16, as that's not a octet- 
string superset-of-ASCII encoding; it probably involves UTF-8 vs. some  
other encoding of Chinese.

As for the other user who filed

	http://jnetpcap.com/node/456

he's using Windows (as per the reference to MinGW and jnetpcap.dll),  
so his problem may ultimately be caused by the lack of  
pcap_wopen_offline().
>
> Cheers,
> mark..
> http://jnetpcap.com
>
>
> _______________________________________________
> Winpcap-users mailing list
> Winpcap-users at winpcap.org
> https://www.winpcap.org/mailman/listinfo/winpcap-users



More information about the Winpcap-users mailing list