Problemer med at læse rss feeds (æøå)

Tags:    c++

Hey

Jeg er ved at lave et program som læser rss feeds, indtil videre virker det meget godt men der er nogle problemer, jeg troede det generalt var problemer med æøå men det viser sig den godt kan læse æøå fra nogle sider men ikke andre. Hvad kan fejlen være?

På dette billede kan i se hvad der er galt
http://www.dury.dk/rod/rss.jpg

Bemærk venligst jeg tog bare kasper's side fordi det lige var lettest, fejlen forekommer også på andre sider ;)

Mvh.

Søren




5 svar postet i denne tråd vises herunder
1 indlæg har modtaget i alt 1 karma
Sorter efter stemmer Sorter efter dato
Okay er der nogen der har nogle eksempler eller noget, for jeg er ikke helt med på hvad det der unicode er, så vil gerne have lidt hjælp til det.

Mvh.
Søren


I stedet for char_t skal du foreksempel bruge wchar_t. PRøv at se nedenstående snippets og definer _UNICODE centralt i din applikation.

-- snip --
TCHAR.H uses two compiler preprocessor symbols to determine how it behaves:

_UNICODE
_MBCS

If neither symbol is defined, ANSI (U.S., Europe) is assumed. If _UNICODE is defined, the code will be compiled for UNICODE; if _MBCS is defined, the code will be compiled for DBCS (MBCS). The behavior if both symbols are defined is undefined.

#ifdef _UNICODE
// UNICODE specific code
#endif

#ifdef _MBCS
// DBCS specific code
#endif

#if !defined(_UNICODE) && !defined(_MBCS)
// ANSI (single byte) specific code
#endif

// *** NON-SPECIFIC CODE ***
//
// Code not under any #ifs or #ifdefs is NOT specific
// to ANY configuration! It must work for all three!
-- snip --

Nedenfor kommer lige en mere generel beskrivelse om unicode og multibytes, som kan være interessant for dig at læse.

-- snip --
Unicode: The Wide-Character Set
A wide character is a 2-byte multilingual character code. Any character in use in modern computing worldwide, including technical symbols and special publishing characters, can be represented according to the Unicode specification as a wide character. Developed and maintained by a large consortium that includes Microsoft, the Unicode standard is now widely accepted. Because every wide character is always represented in a fixed size of 16 bits, using wide characters simplifies programming with international character sets.

A wide character is of type wchar_t. A wide-character string is represented as a wchar_t[] array and is pointed to by a wchar_t* pointer. You can represent any ASCII character as a wide character by prefixing the letter L to the character. For example, L'\\0' is the terminating wide (16-bit) NULL character. Similarly, you can represent any ASCII string literal as a wide-character string literal simply by prefixing the letter L to the ASCII literal (L"Hello").

Generally, wide characters take up more space in memory than multibyte characters but are faster to process. In addition, only one locale can be represented at a time in multibyte encoding, whereas all character sets in the world are represented simultaneously by the Unicode representation.

Multibyte and Wide Characters
A multibyte character is a character composed of sequences of one or more bytes. Each byte sequence represents a single character in the extended character set. Multibyte characters are used in character sets such as Kanji.

Wide characters are multilingual character codes that are always 16 bits wide. The type for character constants is char; for wide characters, the type is wchar_t. Since wide characters are always a fixed size, using wide characters simplifies programming with international character sets.

The wide-character-string literal L"hello" becomes an array of six integers of type wchar_t.

{L'h', L'e', L'l', L'l', L'o', 0}

The Unicode specification is the specification for wide characters. The run-time library routines for translating between multibyte and wide characters include mbstowcs, mbtowc, wcstombs, and wctomb.
-- snip --

Hth



Jeg tror det kan have noget med charsettet at gøre, men jeg skal dog ikke kunne sige det




Hvis du bruger unicode så slipper du for problemer med æøå og andre sjove tegn. Bemærk at der findes en helt række separate streng funktioner til unicode. De hedder typisk noget med "w" som præ- eller postfiks for at indikere at det er til Wide-strenge altså unicode strenge.

Hth



Fejlen ligger i charsettet. Jeg vil forslå, som Jess siger, at du overgår til unicode.



Okay er der nogen der har nogle eksempler eller noget, for jeg er ikke helt med på hvad det der unicode er, så vil gerne have lidt hjælp til det.

Mvh.
Søren



t