Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding issue with gb2312 data #87

Closed
axtens opened this issue May 21, 2013 · 7 comments
Closed

encoding issue with gb2312 data #87

axtens opened this issue May 21, 2013 · 7 comments
Labels

Comments

@axtens
Copy link

axtens commented May 21, 2013

Context: Microsoft JScript on Windows Server 2008 R2 64bit

var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= �� �� &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
var x = new URI(url);
var rMap = x.search(true);

When the .search is executed I get

Microsoft JScript runtime error: The URI to be decoded is not a valid encoding

The break occurs here

d.decodeQuery = function (a) {
    return d.decode((a + "").replace(/+/g, " "))
};

and it"s probably complaining about the "sa= �� �� ". What"s amiss here? Is it fixable? Is it an encoding issue or something else?

@rodneyrehm
Copy link
Member

I can reproduce the issue in Firefox 21 on Mac. This sequence is the problem �� - it can"t be decoded by decodeURIComponent().

decodeURIComponent() expects UTF-8 escape sequences and fails if it can"t resolve the input. Using unescape() the sequence resolves to ËÑ, which would properly be percent-encoded as ËÑ

Can you check what character"s this sequence should resolve to? Can you make sure that the data is UTF-8?

@axtens
Copy link
Author

axtens commented May 22, 2013

As far as I can tell, given the ie and oe variables (&ie=GB2312&oe=GB2312), the characters are GB2312 encoded chinese characters. If I store ËÑË÷ in a text file and, using BabelPad, read them in as GB2312, I get 脣脩脣梅. That expressed as UTF-8 is, in hex, E8 84 A3 E8 84 A9 E8 84 A3 E6 A2 85.

Now, how to deal with this is tricky because the original url has come into our website via Google Hong Kong so we have no way of controlling how the data is encoded. Do I change URI.js to use unescape? At the moment, I run every url through unescape() anyway so that URI.js doesn"t crash on the weird ones.

@rodneyrehm
Copy link
Member

well, URI.js supports UTF8 and ISO 8859 mode. You could easily wrap things:

URI.prototype.getQueryParameters = function() {
  var uri = URI(this.search());
  try {
    return uri.search(true);
  } catch(e) {
    return uri.unicode().search(true);
  }
};

yielding: URI("?a=��").getQueryParameters() == { a="ËÑ"}

I"m not sure if I"d want this to happen automatically, internally, without the implementor even noticing…

@rodneyrehm
Copy link
Member

See #92 as well

@axtens
Copy link
Author

axtens commented Jun 28, 2013

This issue"s popped up again and I"m trying to figure out how to get around it.
The URL in this case is
var url = "http://www.google.com.hk/search?q=pennytel downloads&sa= �� �� &forid=1&prog=aff&ie=GB2312&oe=GB2312&safe=active&source=sdo_sb_html&hl=zh-CN";
and the code which is breaking (with the same error and error-location as above)

var uri = new URI(url); 
//...
var uQuery = uri.clone().setQuery("");

It"s the setQuery that"s failing. How do I set my query to nothing without using setQuery()?

@axtens
Copy link
Author

axtens commented Jun 28, 2013

Ok, simple answer: var uQuery = uri.clone().search("");

@rodneyrehm
Copy link
Member

I"ve fixed this in master - it will be included in the next release. thank you for your help!

QueryString data that cannot be decoded will now simply be returned undecoded - that way any decodable data can still be of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants