Linux.com

Feature: Tools & Utilities

Linux tools to convert file formats

By Federico Kereki on July 22, 2008 (4:00:00 PM)

Share    Print    Comments   

Life would be a lot easier if we could live in a Linux-only world and if applications never required data from other sources. However, the need to get data from Windows, MS-DOS, or old Macintosh systems is all too common. This kind of import process requires some conversions to solve file format differences; otherwise, it would be impossible to share data, or file contents would be imported incorrectly. The easiest way to transfer data between systems is by using plain text files or common formats like comma-separated value (CSV) files. However, converting such files from Windows or Mac OS results in formatting differences for the newline characters and character encoding. This article explains why we have these problems and shows ways to solve them.

The newline problem

Every operating system uses a special character (or sequence of characters) to signify the end of a line of text. They cannot use standard, common characters to represent the line end, because those could appear in normal text, so they use special, nonprinting characters -- but each operating system uses different ones:

  • Linux and Mac OS X inherited the Unix style of using LF (line feed, an ASCII control character) at the end of each line.
  • Older Macintosh systems use CR (carriage return, another ASCII control character).
  • Windows uses a pair of characters -- both a CR and an LF.

To check what a particular text file uses to indicate a new line, try hexdump, which lets you inspect the contents of a file at the byte level. I prepared two three-line files -- one on a Linux system and one on a Windows machine -- and dumped their contents. The file gets dumped 16 characters at a time, showing both the actual characters and their octal equivalents. Notice that the Linux file has a \n character at the end of each line, while the Windows version uses \r and \n. An older Macintosh file would have used a single \r character instead.

> cat test.linux This is the first line of a Linux file. This is the second line. Here's the last line. > hexdump -cb test.linux 0000000 T h i s i s t h e f i r s 0000000 124 150 151 163 040 151 163 040 164 150 145 040 146 151 162 163 0000010 t l i n e o f a L i n u 0000010 164 040 154 151 156 145 040 157 146 040 141 040 114 151 156 165 0000020 x f i l e . \n T h i s i s 0000020 170 040 146 151 154 145 056 012 124 150 151 163 040 151 163 040 0000030 t h e s e c o n d l i n e . 0000030 164 150 145 040 163 145 143 157 156 144 040 154 151 156 145 056 0000040 \n H e r e ' s t h e l a s t 0000040 012 110 145 162 145 047 163 040 164 150 145 040 154 141 163 164 0000050 l i n e . \n 0000050 040 154 151 156 145 056 012 0000057 > cat test.windows This is the first line on a Windows file. This is the second line. Here's the last line. > hexdump -cb test.windows 0000000 T h i s i s t h e f i r s 0000000 124 150 151 163 040 151 163 040 164 150 145 040 146 151 162 163 0000010 t l i n e o n a W i n d 0000010 164 040 154 151 156 145 040 157 156 040 141 040 127 151 156 144 0000020 o w s f i l e . \r \n T h i s 0000020 157 167 163 040 146 151 154 145 056 015 012 124 150 151 163 040 0000030 i s t h e s e c o n d l i 0000030 151 163 040 164 150 145 040 163 145 143 157 156 144 040 154 151 0000040 n e . \r \n H e r e ' s t h e 0000040 156 145 056 015 012 110 145 162 145 047 163 040 164 150 145 040 0000050 l a s t l i n e . \r \n 0000050 154 141 163 164 040 154 151 156 145 056 015 012 000005c

To convert files from Windows to Linux, you can use the appropriately titled dos2unix command. The simplest way to convert test.windows to the Linux format would be with dos2unix test.windows, but you can also use the command in stream fashion -- for example, dos2unix <test.windows >test.windows.fixed. Check all possible options with dos2unix -h or man dos2unix.

An old-fashioned Macintosh text file requires changing CR characters to LF ones, so you could use the tr (translate) command with tr "\015" "\012" <anOldMacintoshFile >theNewLinuxFile, which simply changes each CR (octal 015) into LF (octal 012). With tr, you could also use the -d (delete) option to remove CR characters from a Windows file, thus giving you a valid Linux file. The last conversion shown in the previous paragraph could also be done with tr -d "\015" <test.windows >test.windows.fixed, and the results would be identical.

The encoding problem

English and other languages include some special typographical characters in addition to the normal 26 letters. Have you ever watched the movie Æon Flux or sent in a curriculum vitæ? If you have to import German text, you'll find lots of vowels with an umlaut on them, and some ß characters as well. Spanish adds ň and acute accents to the mix, and French has grave and circumflex accents ("pie à la mode," anybody?). (If you need to type these characters on a standard English keyboard, check out our article on how to customize your keyboard.)

You don't need Unicode for Standard English: you can do perfectly well with ASCII characters, which includes plain unaccented letters from A to Z, digits from 0 to 9, and some punctuation characters. If you deal with only the Latin alphabets used in Western European languages, you might also get files encoded in ISO 8859-1 (also informally known as Latin-1), which lets you use just a single byte per character at the cost of not being able to represent more foreign languages. However, some languages deal with a wider character set and require Unicode, which supports more than 100,000 characters in dozens of languages. Unicode (also known as the ISO 10646 standard) extends ASCII, but where ASCII requires one byte for each character, Unicode requires more. To achieve compatibility between Unicode and ASCII, the UTF-8 character encoding is generally used. UTF-8 uses just a single byte for ASCII characters (so any ASCII file is by default a valid UTF-8 file) and only goes up to more bytes per character for foreign letters and other symbols.

While UTF-8 and Latin-1 do use the same representation for ASCII characters, they differ for other characters; therefore, in order to process the file correctly, you must know its format, or all non-ASCII characters might end up garbled.

Check out the following files, with the same Spanish text ("¡Que la Fuerza te acompañe!", or "May the Force be with you!") in both formats:

> hexdump -cb test.force.utf8 0000000 302 241 Q u e l a F u e r z a 0000000 302 241 121 165 145 040 154 141 040 106 165 145 162 172 141 040 0000010 t e a c o m p a 303 261 e ! \n 0000010 164 145 040 141 143 157 155 160 141 303 261 145 041 012 000001e > hexdump -cb test.force.latin1 0000000 241 Q u e l a F u e r z a t 0000000 241 121 165 145 040 154 141 040 106 165 145 162 172 141 040 164 0000010 e a c o m p a 361 e ! \n 0000010 145 040 141 143 157 155 160 141 361 145 041 012 000001c

If you check the foreign characters (the inverted exclamation sign at the beginning and the ñ near the end), you can verify that UTF-8 uses two bytes for each, while Latin-1 requires just one. Also, note that "normal" ASCII characters are the same in both formats.

Determining what format and encoding you require depends on your particular application. While most Linux programs work with UTF-8, many others use Latin-1, and some are even able to use both. You first need to learn what your application expects, and then convert the text file to that format, if necessary. Fortunately, you can easily accomplish the required translation in both directions (either from or to UTF-8) by using the recode command.

recode offers many options (use recode --help or info recode for a more thorough description), and you can use the program to convert to and from many different formats (try recode -l to get the list of supported formats). However, for simple conversions, it's enough to do recode UTF-8..ISO-8859-1 test.force.utf8 to get a Latin-1 version, or recode ISO-8859-1..UTF-8 test.force.latin1. Depending on the specific conversion you need, the newline problem might be taken care of automatically, but check the documentation (or give it a try and see what happens) for each specific case.

In conclusion

Processing text files from other operating systems is not a straightforward process, but Linux provides tools to make the job easy. No matter what format a file is in, you can automate the required conversion steps and deal with the inconvenience of incompatible formats.

Federico Kereki is an Uruguayan systems engineer with more than 20 years' experience developing systems, doing consulting work, and teaching at universities.

Share    Print    Comments   

Comments

on Linux tools to convert file formats

Note: Comments are owned by the poster. We are not responsible for their content.

Linux tools to convert file formats

Posted by: Anonymous [ip: 206.220.8.10] on July 22, 2008 05:06 PM
"Spanish adds ň and acute..."

I think the character is wrong there. It should be "ñ".

#

Linux tools to convert file formats

Posted by: Anonymous [ip: 146.114.69.89] on July 22, 2008 05:13 PM
For the newline problem I have always used this strange method based on the fact that unzipping converts the characters appropriately:
zip tmp.zip <the files I want to convert>; unzip -ao tmp.zip
and wallah, the files have been converted. Strange, I know, but it works. I always say I'm going to wrap it in a script but never get around to it.

#

Re: Linux tools to convert file formats

Posted by: Anonymous [ip: 80.176.154.65] on July 23, 2008 08:12 AM
wallah?
What kind of conversion did that word go through?

It's voila!

#

Re(1): Linux tools to convert file formats

Posted by: Anonymous [ip: 75.145.41.49] on July 23, 2008 02:30 PM
No it isn't. It's voilà!

#

Re: Linux tools to convert file formats

Posted by: Anonymous [ip: 204.92.92.4] on July 23, 2008 03:23 PM
The can also be done in vi,

:g/^V^M$/s///

^V ^M - press and hold Ctrl key and letter

#

Linux tools to convert file formats

Posted by: Anonymous [ip: 82.241.234.41] on July 22, 2008 06:14 PM
Never use anymore Iso Latin 1 but Iso Latin 9 (aka Iso 8859-15) that includes the €, currency symbol for Euro, actually second most used currency worldwide.

#

Wrong

Posted by: Anonymous [ip: 82.192.250.149] on July 22, 2008 09:54 PM
"Life would be a lot easier if we could live in a Linux-only world"

No it wouldn't. We need diversity to open our minds to different ways of doing things. And there are some things (systems with more than 64 cpus, for example) for which Unix is still better than Linux.

What we need is clear, open, documented file formats. We have that for the simple ascii files which this article covers, but not for many other kinds of file.

#

File Formats?

Posted by: Anonymous [ip: 82.95.236.142] on July 23, 2008 08:06 AM
Are we actually talking about file formats here?
Character encoding and file format are not synonyms.

bjd

#

Linux tools to convert file formats

Posted by: Anonymous [ip: 89.235.36.227] on July 23, 2008 10:35 AM
Life would be a lot easier if we could live in a Windows-only world and if applications never required data from other sources.

#

Re: Linux tools to convert file formats

Posted by: Anonymous [ip: 70.139.75.75] on July 26, 2008 03:47 PM
AAhhhh, where do I start with this comment?. How about, internet life would be easier
without trolls?. How about this one? Go live in a Windows only world, millions of other
people do and don't care.

#

Linux tools to convert file formats

Posted by: Anonymous [ip: 69.69.28.85] on July 23, 2008 12:23 PM
Dude, all I know is LInux totally ROCKS! Period. Best O/S of all times. Wish I would have made the switch years ago.

JT
www.FireMe.To/udi

#

Re: Linux tools to convert file formats

Posted by: Anonymous [ip: 68.229.155.86] on July 24, 2008 01:40 AM
A juvenile response at best...

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya