Search Results

Search found 5325 results on 213 pages for 'huffman encoding'.

Page 18/213 | < Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25  | Next Page >

  • Time of splitting and encoding video with ffmpeg increases exponentially

    - by Rnd_d
    I'm trying to split video by subtitles and encode into .mp4(h.264/aac) using ffmpeg, but it takes so much time! First pieces of video are splited and encoded really fast, but for each iteration time increases, and so 40 min video takes all night or more. Small 3 min video takes max 10 mins. cmd for splitting and encoding: ffmpeg -i filename.avi -ss 00:00:0(time of sub start) -t 0:0:3(time of sub duration) -acodec libfaac -vcodec libx264 -bf 0 -f mp4 filename.mp4 ffmpeg version N-34849-g07c7ffc (last, i think) How I can make it faster? Are there, maybe, some magic arguments for ffmpeg or some hacks?

    Read the article

  • Looking for more details about "Group varint encoding/decoding" presented in Jeff's slides

    - by Mickey Shine
    I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself. I am not a pro and new to this, and any help is welcome!

    Read the article

  • stdout and stderr character encoding

    - by Muhammad alaa
    i working on a c++ string library that have main 4 classes that deals with ASCII, UTF8, UTF16, UTF32 strings, every class has Print function that format an input string and print the result to stdout or stderr. my problem is i don't know what is the default character encoding for those streams. for now my classes work in windows, later i'll add support for mac and linux so if you know anything about those stream encoding i'll appreciate it. thank you.

    Read the article

  • Crystal reports - Encoding Issue UTF-8 / iso-8559-1

    - by Lloyd
    Hi, Has anyone had this error before when running reports in Crystal Reports 2008: Character set encoding from transport information [UTF-8] does not match with character set encoding in the received SOAP message [iso-8559]. I presume this indicates a file needs changing, but trying to fint it is causing me a real headache. There are references to UTF-8 all over the server. Any help would be great. Thanks, Lloyd.

    Read the article

  • Encoding error PostgreSQL 8.4

    - by KandadaBoggu
    I am importing data from a CSV file. One of the fields has an accent(Telefónica O2 UK Limited). The application throws en error while inserting the data to the table. PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xf36e6963 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". : INSERT INTO "companies" ("name", "validated") VALUES(E'Telef?nica O2 UK Limited', 't') How do I workaround this issue?

    Read the article

  • Default encoding type for wsHttp binding

    - by user102533
    My understanding was that the default encoding for wsHttp binding is text. However, when I use Fiddler to see the SOAP message, a part of it looks like this: <s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:a="http://www.w3.org/2005/08/addressing" xmlns:u="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"><s:Header><a:Action s:mustUnderstand="1" u:Id="_2">http://tempuri.org/Services/MyContract/GetDataResponse</a:Action><a:RelatesTo u:Id="_3">urn:uuid:503c5525-f585-4ecd-ac09-24db78526952</a:RelatesTo><o:Security s:mustUnderstand="1" xmlns:o="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd"><u:Timestamp u:Id="uuid-8935f789-fbb7-4c69-9f67-7708373088c5-22"><u:Created>2010-03-08T19:15:50.852Z</u:Created><u:Expires>2010-03-08T19:20:50.852Z</u:Expires></u:Timestamp><c:DerivedKeyToken u:Id="uuid-8935f789-fbb7-4c69-9f67-7708373088c5-18" xmlns:c="http://schemas.xmlsoap.org/ws/2005/02/sc"><o:SecurityTokenReference><o:Reference URI="urn:uuid:b2cbfe07-8093-4f44-8a06-f8b062291643" ValueType="http://schemas.xmlsoap.org/ws/2005/02/sc/sct"/></o:SecurityTokenReference><c:Offset>0</c:Offset><c:Length>24</c:Length><c:Nonce>afOoDygRG7BW+q8+makVIA==</c:Nonce></c:DerivedKeyToken><c:DerivedKeyToken u:Id="uuid-8935f789-fbb7-4c69-9f67-7708373088c5-19" xmlns:c="http://schemas.xmlsoap.org/ws/2005/02/sc"><o:SecurityTokenReference><o:Reference URI="urn:uuid:b2cbfe07-8093-4f44-8a06-f8b062291643" ValueType="http://schemas.xmlsoap.org/ws/2005/02/sc/sct"/></o:SecurityTokenReference><c:Nonce>l4rFsdYKLJTK4tgUWrSBRw==</c:Nonce></c:DerivedKeyToken><e:ReferenceList xmlns:e="http://www.w3.org/2001/04/xmlenc#"><e:DataReference URI="#_1"/><e:DataReference URI="#_4"/></e:ReferenceList><e:EncryptedData Id="_4" Type="http://www.w3.org/2001/04/xmlenc#Element" xmlns:e="http://www.w3.org/2001/04/xmlenc#"><e:EncryptionMethod Algorithm="http://www.w3.org/2001/04/xmlenc#aes256-cbc"/><KeyInfo xmlns="http://www.w3.org/2000/09/xmldsig#"><o:SecurityTokenReference><o:Reference ValueType="http://schemas.xmlsoap.org/ws/2005/02/sc/dk" URI="#uuid-8935f789-fbb7-4c69-9f67-7708373088c5-19"/></o:SecurityTokenReference></KeyInfo><e:CipherData><e:CipherValue>dW8B7wGV9tYaxM5ADzY6UuEgB5TFzdy4BZjOtF0NEbHyNevCIAVHMoyA69U4oUjQHMJD5nHS0N4tnJqfJkYellKlpFZcwqruJ1J/TFx9uwLFFAwZ+dSfkDqgKu/1MFzVSY8eyeYKmbPbVEYOHr0lhw3+7wn5NQr3yxvCjlucTAdklIhD72YnVlSVapOW3zgysGt5hStyj+bmIz5hLGyyv6If4HzWjUiru8V3iMM/ss1I+i9sJOD013kr4zaaA937CN9+/aZ2wbDXnYj31UX49uE/vvt9Tl+c4SiydbiX7tp1eNSTx9Ms5O64gb3aUmHEAYOJ19XCrr756ssFZtaE7QOAoPQkFbx9zXy0mb9j1YoPQNG+JAcrN0yoRN1klhccmY+csfYXdq7YBB/KS+u2WnUjQ7SlNFy5qIPxuy5y0Jyedr2diPKLi0gUi+cK49BLQtG/XEShtxFaeMy7zZTrQADxww7kEkhvtmAlmyRbz3oGc+ This doesn't look like text encoding to me (Shouldn't text encoding send data in readable form)? What am I missing? Also, how do I setup binary encoding for wsHttp binding?

    Read the article

  • PHP PowerPoint with kohana 3 got encoding errors

    - by Jabeen
    I have just downloaded the fresh copy of phppowerpoint and added the following line of code 'phppowerpoint' = MODPATH.'phppowerpoint', and now my page got blank and I got the following error: The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must to be declared in the document or in the transfer protocol. Any thoughts? Thanks.

    Read the article

  • zend studio encoding problem

    - by Syom
    i used Zend Studio for eclipse, and now i have a problem with encoding. when i try to write text in foreign language, it doesn't understand the font. but in browser it shows normally. i've set then default encoding to UTF-8, but it doesn't work. could you help me. Thanks

    Read the article

  • SQLite MYSQL character encoding

    - by Lee Armstrong
    I have a strange situation where the following code works however XCode warns it is deprecated... NSString *col1 = [NSString stringWithCString:(char *)sqlite3_column_text(compiledStatement, 0)]; However as that is the deprecated method if I set an encoding the string comes out wrong! I have tried all the encodings but none work! NSString *col1 = [NSString stringWithCString:(char *)sqlite3_column_text(compiledStatement, 0) encoding:NSASCIIStringEncoding];

    Read the article

  • How ensure if java program uses UTF-8 encoding

    - by Nayn
    Hi, I recently discovered that relying on default encoding of JVM causes bugs. I should explicitly use specific encoding ex. UTF-8 while working with String, InputStynreams etc. I have a huge codebase to scan for ensuring this. Could somebody suggest me some simpler way to check this than searching the whole codebase. Thanks Nayn

    Read the article

  • Encode real-time dvb-s stream using mencoder

    - by karatchov
    My satellite receiver can stream the mpeg-2 video/audio output through lan. Using mencoder, I'm trying to build a script to encode and save the stream in real time with my Core2Duo 1.8 Ghz. Right now, I'm using a single pass, it produces good quality for a video rate of 800Kb/s, but takes more then 95% of CPU power, thus making a lot of frameskips is the computer is used while encoding. mencoder -o -vf lavcdeint -oac mp3lame -lameopts abr:q=2:aq=2 -ovc x264 -ffourcc avc1 -x264encopts crf=25:me=hex:subq=9:frameref=2:nocabac:threads=auto -mc 3 So, I'm considering using a 2-pass encoding to alleviate the processor and record 100% of the stream. But I have no idea how to start. For the info: Standard Stream: mpeg-2 720*576 25fps HD Stream: 1920*1080 50fps (this is not my goal to record it, but it will be super cool if I could)

    Read the article

  • How to prevent stretching, blurring and pixelating of embedded logos in VirtualDub?

    - by NoCanDo
    Howdy, take a look at this please http://www.youtube.com/watch?v=te-HVN8y_QE&hd=1 . Notice the embedded "logo" in the upper left corner? How blurry and pixelated it is? This is the original image: http://i.imagehost.org/0148/movie_watermark.png ! The stretching, blurring, pixelating etc. most likely comes from resizing the original video from 1920x1200 to 1280x720 and encoding it with h.264. Can anyone tell me how I can prevent the blurring, unsharpening and pixelating and retain their original quality? How do I exclude the logo from the whole encoding process and just slap it there in its original format and form?

    Read the article

  • Postfix character encoding?

    - by Anonymous12345
    I use Postfix as a mailserver. I have Ubuntu OS. Then I use PHP to send emails. Problem is that none of my emails are encoded properly by a mailsoftware which my VPS provider uses. According to them, the problem lies with me. It is only the name field which isn't encoded properly. For example "Björn" becomes "Björn" in my emails. However, when I echo the $name, it outputs "Björn" which is correct. Also, gmail and hotmail does show it correctly. The strange part is that the "text" (the message itself) is encoded properly. I use the following for sending mail: $headers="MIME-Version: 1.0"."\n"; $headers.="Content-type: text/plain; charset=UTF-8"."\n"; $headers.="From: $name <$email>"."\n"; $name= iconv(mb_detect_encoding($name), "UTF-8//IGNORE//TRANSLIT", $name); //// I HAVE TRIED WITH AND WITHOUT THE LINE ABOVE, NO DIFFERENCE mail($to, '=?UTF-8?B?'.base64_encode($subject).'?=', $text, $headers, '[email protected]'); I have tried with and without the iconv line also, no luck. The last thing I can think of is POSTFIX, could there be a setting for character encoding there? Anybody knows?

    Read the article

  • Postfix character encoding?

    - by Camran
    I use Postfix as a mailserver. I have Ubuntu OS. Then I use PHP to send emails. Problem is that none of my emails are encoded properly by a mailsoftware which my VPS provider uses. According to them, the problem lies with me. It is only the name field which isn't encoded properly. For example "Björn" becomes "Björn" in my emails. However, when I echo the $name, it outputs "Björn" which is correct. Also, gmail and hotmail does show it correctly. The strange part is that the "text" (the message itself) is encoded properly. I use the following for sending mail: $headers="MIME-Version: 1.0"."\n"; $headers.="Content-type: text/plain; charset=UTF-8"."\n"; $headers.="From: $name <$email>"."\n"; $name= iconv(mb_detect_encoding($name), "UTF-8//IGNORE//TRANSLIT", $name); //// I HAVE TRIED WITH AND WITHOUT THE LINE ABOVE, NO DIFFERENCE mail($to, '=?UTF-8?B?'.base64_encode($subject).'?=', $text, $headers, '[email protected]'); I have tried with and without the iconv line also, no luck. The last thing I can think of is POSTFIX, could there be a setting for character encoding there? Anybody knows?

    Read the article

  • x86 instruction encoding tables

    - by Cheery
    I'm in middle of rewriting my assembler. While at it I'm curious about implementing disassembly as well. I want to make it simple and compact, and there's concepts I can exploit while doing so. It is possible to determine rest of the x86 instruction encoding from opcode (maybe prefix bytes are required too, a bit). I know many people have written tables for doing it. I'm not interested about mnemonics but instruction encoding, because it is an actual hard problem there. For each opcode number I need to know: does this instruction contain modrm? how many immediate fields does this instruction have? what encoding does an immediate use? is the immediate in field an instruction pointer -relative address? what kind of registers does the modrm use for operand and register fields? sandpile.org has somewhat quite much what I'd need, but it's in format that isn't easy to parse. Before I start writing and validating those tables myself, I decided to write this question. Do you know about this kind of tables existing somewhere? In a form that doesn't require too much effort to parse.

    Read the article

  • Using both chunked transfer encoding and gzip

    - by RadiantHeart
    I recently started using gzip on my site and it worked like charm on all browsers except Opera which gives an error saying it could not decompress the content due to damaged data. From what I can gather from testing and googling it might be a problem with using both gzip and chunked transfer encoding. The fact that there is no error when requesting small files like css-files also points in that direction. Is this a known issue or is there something else that I havent thought about? Someone also mentioned that it could have something to do with sending a Content-Length header. Here is a simplified version of the most relevant part of my code: $contents = ob_get_contents(); ob_end_clean(); header('Content-Encoding: '.$encoding); print("\x1f\x8b\x08\x00\x00\x00\x00\x00"); $size = strlen($contents); $contents = gzcompress($contents, 9); $contents = substr($contents, 0, $size); print($contents); exit();

    Read the article

  • Need help with PHP URL encoding/decoding

    - by Kenan
    On one page I'm "masking"/encoding URL which is passed to another page, there I decode URL and start file delivering to user. I found some function for encoding/decoding URL's, but sometime encoded URL contains "+" or "/" and decoded link is broken. I must use "folder structure" for link, can not use QueryString! Here is encoding function: $urll = 'SomeUrl.zip'; $key = '123'; $result = ''; for($i=0; $i<strlen($urll); $i++) { $char = substr($urll, $i, 1); $keychar = substr($key, ($i % strlen($key))-1, 1); $char = chr(ord($char)+ord($keychar)); $result.=$char; } $result = urlencode(base64_encode($result)); echo '<a href="/user/download/'.$result.'/">PC</a>'; Here is decoding: $urll = 'segment_3'; //Don't worry for this one its CMS retrieving 3rd "folder" $key = '123'; $resultt = ''; $string = ''; $string = base64_decode(urldecode($urll)); for($i=0; $i<strlen($string); $i++) { $char = substr($string, $i, 1); $keychar = substr($key, ($i % strlen($key))-1, 1); $char = chr(ord($char)-ord($keychar)); $resultt.=$char; } echo '<br />DEC: '. $resultt; So how to encode and decode url. Thanks

    Read the article

  • Encoding multiple video streams with a single avconv invocation

    - by automatthias
    I played with avconv on Ubuntu and I'm now able to e.g. record the desktop with sound from a soundcard. One thing I wanted to do was recording two video inputs at the same time, for instance the desktop and from the webcam. I thought about doing something like this: avconv \ -f alsa \ -i default \ -acodec flac \ -f video4linux2 \ -r 6 \ -i /dev/video0 \ -f x11grab \ -i :0.0 \ out.mkv My thinking was that if you define multiple video inputs, and the .mkv format can handle multiple video streams, avconv will encode 2 video streams and 1 audio stream into one file. But this isn't what happens: avconv version 0.8.4-6:0.8.4-0ubuntu0.12.10.1, Copyright (c) 2000-2012 the Libav developers built on Nov 6 2012 16:51:11 with gcc 4.7.2 [alsa @ 0x1091bc0] capture with some ALSA plugins, especially dsnoop, may hang. [alsa @ 0x1091bc0] Estimating duration from bitrate, this may be inaccurate Input #0, alsa, from 'default': Duration: N/A, start: 1354364317.020350, bitrate: N/A Stream #0.0: Audio: pcm_s16le, 48000 Hz, 2 channels, s16, 1536 kb/s [video4linux2 @ 0x10923e0] Estimating duration from bitrate, this may be inaccurate Input #1, video4linux2, from '/dev/video0': Duration: N/A, start: 100607.724745, bitrate: 29491 kb/s Stream #1.0: Video: rawvideo, yuyv422, 640x480, 29491 kb/s, 6 tbr, 1000k tbn, 6 tbc [x11grab @ 0x107b2a0] device: :0.0+83,87 -> display: :0.0 x: 83 y: 87 width: 854 height: 480 [x11grab @ 0x107b2a0] shared memory extension found [x11grab @ 0x107b2a0] Estimating duration from bitrate, this may be inaccurate Input #2, x11grab, from ':0.0+83,87': Duration: N/A, start: 1354364318.488382, bitrate: 196761 kb/s Stream #2.0: Video: rawvideo, bgra, 854x480, 196761 kb/s, 15 tbr, 1000k tbn, 15 tbc Incompatible pixel format 'bgra' for codec 'mpeg4', auto-selecting format 'yuv420p' [buffer @ 0x107fcc0] w:854 h:480 pixfmt:bgra [avsink @ 0x10bdf00] auto-inserting filter 'auto-inserted scaler 0' between the filter 'src' and the filter 'out' [scale @ 0x10dc680] w:854 h:480 fmt:bgra -> w:854 h:480 fmt:yuv420p flags:0x4 Output #0, matroska, to '.../out.mkv': Metadata: encoder : Lavf53.21.0 Stream #0.0: Video: mpeg4, yuv420p, 854x480, q=2-31, 4000 kb/s, 1k tbn, 15 tbc Stream #0.1: Audio: libvorbis, 48000 Hz, 2 channels, s16 Stream mapping: Stream #2:0 -> #0:0 (rawvideo -> mpeg4) Stream #0:0 -> #0:1 (pcm_s16le -> libvorbis) Press ctrl-c to stop encoding [mpeg4 @ 0x10bd800] rc buffer underflow ^Cframe= 160 fps= 15 q=2.0 Lsize= 3414kB time=10.66 bitrate=2623.0kbits/s video:3273kB audio:131kB global headers:4kB muxing overhead 0.165600% Received signal 2: terminating. I'm not sure if it's the question of mapping (some -map options to add?) or that avconv just can't encode more than 1 video stream at one time. So is it an actual avconv limitation, or a limitation of the available containers, or me simply not finding the right combination of command line options?

    Read the article

  • SQLite, python, unicode, and non-utf data

    - by Nathan Spears
    I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur Rós' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

    Read the article

  • SQLite, python, unicode, and non-utf data

    - by Nathan Spears
    I started by trying to store strings in sqlite using python, and got the message: sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. Ok, I switched to Unicode strings. Then I started getting the message: sqlite3.OperationalError: Could not decode to UTF-8 column 'tag_artist' with text 'Sigur Rós' when trying to retrieve data from the db. More research and I started encoding it in utf8, but then 'Sigur Rós' starts looking like 'Sigur Rós' note: My console was set to display in 'latin_1' as @John Machin pointed out. What gives? After reading this, describing exactly the same situation I'm in, it seems as if the advice is to ignore the other advice and use 8-bit bytestrings after all. I didn't know much about unicode and utf before I started this process. I've learned quite a bit in the last couple hours, but I'm still ignorant of whether there is a way to correctly convert 'ó' from latin-1 to utf-8 and not mangle it. If there isn't, why would sqlite 'highly recommend' I switch my application to unicode strings? I'm going to update this question with a summary and some example code of everything I've learned in the last 24 hours so that someone in my shoes can have an easy(er) guide. If the information I post is wrong or misleading in any way please tell me and I'll update, or one of you senior guys can update. Summary of answers Let me first state the goal as I understand it. The goal in processing various encodings, if you are trying to convert between them, is to understand what your source encoding is, then convert it to unicode using that source encoding, then convert it to your desired encoding. Unicode is a base and encodings are mappings of subsets of that base. utf_8 has room for every character in unicode, but because they aren't in the same place as, for instance, latin_1, a string encoded in utf_8 and sent to a latin_1 console will not look the way you expect. In python the process of getting to unicode and into another encoding looks like: str.decode('source_encoding').encode('desired_encoding') or if the str is already in unicode str.encode('desired_encoding') For sqlite I didn't actually want to encode it again, I wanted to decode it and leave it in unicode format. Here are four things you might need to be aware of as you try to work with unicode and encodings in python. The encoding of the string you want to work with, and the encoding you want to get it to. The system encoding. The console encoding. The encoding of the source file Elaboration: (1) When you read a string from a source, it must have some encoding, like latin_1 or utf_8. In my case, I'm getting strings from filenames, so unfortunately, I could be getting any kind of encoding. Windows XP uses UCS-2 (a Unicode system) as its native string type, which seems like cheating to me. Fortunately for me, the characters in most filenames are not going to be made up of more than one source encoding type, and I think all of mine were either completely latin_1, completely utf_8, or just plain ascii (which is a subset of both of those). So I just read them and decoded them as if they were still in latin_1 or utf_8. It's possible, though, that you could have latin_1 and utf_8 and whatever other characters mixed together in a filename on Windows. Sometimes those characters can show up as boxes, other times they just look mangled, and other times they look correct (accented characters and whatnot). Moving on. (2) Python has a default system encoding that gets set when python starts and can't be changed during runtime. See here for details. Dirty summary ... well here's the file I added: \# sitecustomize.py \# this file can be anywhere in your Python path, \# but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('utf_8') This system encoding is the one that gets used when you use the unicode("str") function without any other encoding parameters. To say that another way, python tries to decode "str" to unicode based on the default system encoding. (3) If you're using IDLE or the command-line python, I think that your console will display according to the default system encoding. I am using pydev with eclipse for some reason, so I had to go into my project settings, edit the launch configuration properties of my test script, go to the Common tab, and change the console from latin-1 to utf-8 so that I could visually confirm what I was doing was working. (4) If you want to have some test strings, eg test_str = "ó" in your source code, then you will have to tell python what kind of encoding you are using in that file. (FYI: when I mistyped an encoding I had to ctrl-Z because my file became unreadable.) This is easily accomplished by putting a line like so at the top of your source code file: # -*- coding: utf_8 -*- If you don't have this information, python attempts to parse your code as ascii by default, and so: SyntaxError: Non-ASCII character '\xf3' in file _redacted_ on line 81, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details Once your program is working correctly, or, if you aren't using python's console or any other console to look at output, then you will probably really only care about #1 on the list. System default and console encoding are not that important unless you need to look at output and/or you are using the builtin unicode() function (without any encoding parameters) instead of the string.decode() function. I wrote a demo function I will paste into the bottom of this gigantic mess that I hope correctly demonstrates the items in my list. Here is some of the output when I run the character 'ó' through the demo function, showing how various methods react to the character as input. My system encoding and console output are both set to utf_8 for this run: '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Now I will change the system and console encoding to latin_1, and I get this output for the same input: 'ó' = original char <type 'str'> repr(char)='\xf3' 'ó' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' 'ó' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Notice that the 'original' character displays correctly and the builtin unicode() function works now. Now I change my console output back to utf_8. '?' = original char <type 'str'> repr(char)='\xf3' '?' = unicode(char) <type 'unicode'> repr(unicode(char))=u'\xf3' '?' = char.decode('latin_1') <type 'unicode'> repr(char.decode('latin_1'))=u'\xf3' '?' = char.decode('utf_8') ERROR: 'utf8' codec can't decode byte 0xf3 in position 0: unexpected end of data Here everything still works the same as last time but the console can't display the output correctly. Etc. The function below also displays more information that this and hopefully would help someone figure out where the gap in their understanding is. I know all this information is in other places and more thoroughly dealt with there, but I hope that this would be a good kickoff point for someone trying to get coding with python and/or sqlite. Ideas are great but sometimes source code can save you a day or two of trying to figure out what functions do what. Disclaimers: I'm no encoding expert, I put this together to help my own understanding. I kept building on it when I should have probably started passing functions as arguments to avoid so much redundant code, so if I can I'll make it more concise. Also, utf_8 and latin_1 are by no means the only encoding schemes, they are just the two I was playing around with because I think they handle everything I need. Add your own encoding schemes to the demo function and test your own input. One more thing: there are apparently crazy application developers making life difficult in Windows. #!/usr/bin/env python # -*- coding: utf_8 -*- import os import sys def encodingDemo(str): validStrings = () try: print "str =",str,"{0} repr(str) = {1}".format(type(str), repr(str)) validStrings += ((str,""),) except UnicodeEncodeError as ude: print "Couldn't print the str itself because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print ude try: x = unicode(str) print "unicode(str) = ",x validStrings+= ((x, " decoded into unicode by the default system encoding"),) except UnicodeDecodeError as ude: print "ERROR. unicode(str) couldn't decode the string because the system encoding is set to an encoding that doesn't understand some character in the string." print "\tThe system encoding is set to {0}. See error:\n\t".format(sys.getdefaultencoding()), print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the unicode(str) because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('latin_1') print "str.decode('latin_1') =",x validStrings+= ((x, " decoded with latin_1 into unicode"),) try: print "str.decode('latin_1').encode('utf_8') =",str.decode('latin_1').encode('utf_8') validStrings+= ((x, " decoded with latin_1 into unicode and encoded into utf_8"),) except UnicodeDecodeError as ude: print "The string was decoded into unicode using the latin_1 encoding, but couldn't be encoded into utf_8. See error:\n\t", print ude except UnicodeDecodeError as ude: print "Something didn't work, probably because the string wasn't latin_1 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('latin_1') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t", print uee try: x = str.decode('utf_8') print "str.decode('utf_8') =",x validStrings+= ((x, " decoded with utf_8 into unicode"),) try: print "str.decode('utf_8').encode('latin_1') =",str.decode('utf_8').encode('latin_1') except UnicodeDecodeError as ude: print "str.decode('utf_8').encode('latin_1') didn't work. The string was decoded into unicode using the utf_8 encoding, but couldn't be encoded into latin_1. See error:\n\t", validStrings+= ((x, " decoded with utf_8 into unicode and encoded into latin_1"),) print ude except UnicodeDecodeError as ude: print "str.decode('utf_8') didn't work, probably because the string wasn't utf_8 encoded. See error:\n\t", print ude except UnicodeEncodeError as uee: print "ERROR. Couldn't print the str.decode('utf_8') because the console is set to an encoding that doesn't understand some character in the string. See error:\n\t",uee print print "Printing information about each character in the original string." for char in str: try: print "\t'" + char + "' = original char {0} repr(char)={1}".format(type(char), repr(char)) except UnicodeDecodeError as ude: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), ude) except UnicodeEncodeError as uee: print "\t'?' = original char {0} repr(char)={1} ERROR PRINTING: {2}".format(type(char), repr(char), uee) print uee try: x = unicode(char) print "\t'" + x + "' = unicode(char) {1} repr(unicode(char))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = unicode(char) ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = unicode(char) {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('latin_1') print "\t'" + x + "' = char.decode('latin_1') {1} repr(char.decode('latin_1'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('latin_1') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('latin_1') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) try: x = char.decode('utf_8') print "\t'" + x + "' = char.decode('utf_8') {1} repr(char.decode('utf_8'))={2}".format(x, type(x), repr(x)) except UnicodeDecodeError as ude: print "\t'?' = char.decode('utf_8') ERROR: {0}".format(ude) except UnicodeEncodeError as uee: print "\t'?' = char.decode('utf_8') {0} repr(char)={1} ERROR PRINTING: {2}".format(type(x), repr(x), uee) print x = 'ó' encodingDemo(x) Much thanks for the answers below and especially to @John Machin for answering so thoroughly.

    Read the article

  • Why does text from Assembly.GetManifestResourceStream() start with three junk characters?

    - by flipdoubt
    I have a SQL file added to my VS.NET 2008 project as an embedded resource. Whenever I use the following code to read the file's content, the string returned always starts with three junk characters and then the text I expect. I assume this has something to do with the Encoding.Default I am using, but that is just a guess. Why does this text keep showing up? Should I just trim off the first three characters or is there a more informed approach? public string GetUpdateRestoreSchemaScript() { var type = GetType(); var a = Assembly.GetAssembly(type); var script = "UpdateRestoreSchema.sql"; var resourceName = String.Concat(type.Namespace, ".", script); using(Stream stream = a.GetManifestResourceStream(resourceName)) { byte[] buffer = new byte[stream.Length]; stream.Read(buffer, 0, buffer.Length); // UPDATE: Should be Encoding.UTF8 return Encoding.Default.GetString(buffer); } } Update: I now know that my code works as expected if I simply change the last line to return a UTF-8 encoded string. It will always be true for this embedded file, but will it always be true? Is there a way to test any buffer to determine its encoding?

    Read the article

  • Perl strings internals

    - by n0rd
    How does perl strings represented internally? What encoding is used? How do I handle different encodings properly? I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions. Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble. First, I read file into string like this: open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading"; binmode($in, ':utf8'); my $contents; { local $/; $contents = <$in>; } close($in); then simply print it: print $contents; And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character". (I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues). Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW). If perl stores strings as multibyte sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling multibyte strings is PAIN.

    Read the article

< Previous Page | 14 15 16 17 18 19 20 21 22 23 24 25  | Next Page >