JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

Posted by Mark Bennett on Stack Overflow See other posts from Stack Overflow or by Mark Bennett
Published on 2010-12-22T22:48:29Z Indexed on 2010/12/22 22:54 UTC
Read the original article Hit count: 262

Filed under:
|
|
|
|

I read in search terms from a simple text file to send to a search engine. It works fine in English, but gives me ???? for any Japanese text. Text with mixed English and Japanese does show the English text, so I know it's reading it.

What I'm seeing:

Input text:
Snow Leopard ???????????????

Turns into:
Snow Leopard ???????????????

This is in my POST field of an HTTP. If I set JMeter to encode the data, it just puts in the percent sequence for question marks.

Interesting note: In the example above there are 15 Japanese characters, and then 15 question marks, so at some point it's being seen as full characters and not just bytes.

About the Data:

The CSV file is very simple in structure.
There's only one field / one column, which I name TERM, and later use as ${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
When I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified it in command line and graphical mode on two machines.

JMeter CSV Dataset Config:

Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads

A few things I've tried:

  • Tried Allow quoted Data. It changed to other strange characters.
  • -Dfile.encoding=UTF-8
  • Tried encoding the POST, but it just turned into a bunch of %nn for question marks

And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.

If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.

© Stack Overflow or respective owner

Related posts about encoding

Related posts about utf-8