Re: I love character encoding!

texec:

marco:

Tonight’s goal: Make a simple PHP class.

  • Input: a URL pointing to an HTML document.
  • Output: a UTF-8 version, regardless of what encoding it’s really in.

Sounds easy, right?

mb_detect_encoding and mb_convert_encdoing plus a bit string magic. Shouldn’t be that complex - love to see your implementation.

Shouldn’t be, but it is.

  • mb_detect_encoding doesn’t always detect properly. It works statistically, and it’s imperfect.
  • mb_convert_encoding is generally better than iconv, but iconv supports more input encodings.
  • Both mb_convert_encoding and iconv are only as good as your input-encoding detection. If you tell them that the input is e.g. GB2312, you better be reasonably sure that it’s not something else.