Provide way to allow for new buffer encodings. #1772

bmeck · 2011-09-26T16:41:15Z

Right now buffers use hard coded values for encodings, when making certain protocols ( http://self-issued.info/docs/draft-jones-json-web-signature.html#base64urllogic ) this requires creating entirely new wrappers for toString and no way to support it via the Buffer constructor.

Having something similar to:

Buffer.registerStringEncoding(name, function stringify(buffer) => String, function parse(string) => Buffer) throws on duplicate name

The performance issue could be worked around by never encoding to a buffer until a toString or property related to the buffer value is requested (and then not even doing that if the encodings match on toString), and then cacheing the result.

The text was updated successfully, but these errors were encountered:

bobrik · 2011-09-26T17:56:19Z

Extending built-in object isn't a good solution.

bmeck · 2011-09-26T17:58:52Z

bobrik, Buffer is not a built-in for javascript, it is created by Node's core.

bobrik · 2011-09-26T18:02:20Z

It's built-in for node. My own opinion: create helper object to do that task.

In other case, how could I know in any module that encoding is registered?

bmeck · 2011-09-26T18:20:39Z

This is a feature request to allow extension to the "built-in" Buffer
object, not change any existing behavior/signature beyond new
encodings; regardless of how it should be done now I feel that having
a multitude of wrappers may introduce a difficulty in the encoding /
decoding of formats as the wrappers may have many signatures and
encourage tight coupling to systems.

In order to detect if an encoding is present, one would need to
manually add the encoding to ensure it is registered, or have a means
to detect if the encoding is present when the module needing the
encoding is required.

Simply requiring 'encodingType.js' seems simplistic and being at setup
time would most likely be fine in order to install these items in a
similar manner that requiring coffeescript acts on the '.coffee'
files. This is less than elegant than it could be, but far more
elegant than providing multiple wrappers for buffers that would
perform the same task of implementing a new encoding in potentially
many different ways.

On Mon, Sep 26, 2011 at 1:02 PM, Ian Babrou
reply@reply.github.com
wrote:

It's built-in for node. My own opinion: create helper object to do that task.

In other case, how could I know in any module that encoding is registered?

Reply to this email directly or view it on GitHub:
#1772 (comment)

Mithgol · 2011-11-28T05:31:41Z

@bobrik

how could I know in any module that encoding is registered?

Very good point. Node.js probably needs boolean .stringEncodingRegistered(encodingName) in addition to .registerStringEncoding(name, decoder, encoder).

Mithgol · 2012-03-26T16:50:25Z

However, the node_buffer.cc file does not look like it was ever written with freely extensible list of encodings in mind.

There are only a few pre-defined encodings, they appear in many elseifs, they are hardcoded in constructor names, etc.

It won't be easy to write a function that adds yet another encoding to that list, because there was never a list as a real data structure to begin with, and you cannot alter a pile of elseif lines programmaticaly. You'll have to rewrite a good deal of the existing code before you start writing such a function.

Mithgol · 2012-05-11T08:03:32Z

I also think that JavaScript functions stringify(buffer) and parse(string) are going to be slow.

For many use cases that is inevitable; however, for simple encodings (such as CP 866, KOI8-R, Windows-1251) the node_buffer.cc could contain some function like PHP's strtr() that accepts a simple array of replace pairs and does most of the work on its own (C++, not JavaScript) level, which is faster.

The module node-iconv (by @bnoordhuis) already does something similar, but supports only some predefined encodings, and is non-Windows, and requires compilation (while Node.js core binaries for Windows are pre-built and available for download).

bnoordhuis · 2012-05-17T22:56:56Z

The module node-iconv (by @bnoordhuis) already does something similar, but supports only some predefined encodings, and is non-Windows, and requires compilation (while Node.js core binaries for Windows are pre-built and available for download).

node-iconv should now compile on Windows: bnoordhuis/node-iconv#27

As to 'only some predefined encodings', we're talking about 120 supported encodings. :-)

bnoordhuis · 2013-10-10T12:48:44Z

Closing, no activity in over a year and node-iconv and iconv-lite have the market pretty much cornered.

We may someday allow pluggable encodings but that's ultra-low priority.

Mithgol · 2013-10-12T06:54:40Z

Iconv-lite seems to have the .decode(buffer, encoding) signature, while Node.js Buffer's .toString does also have start and end optional parameters.

I worked around it by creating my own node-singlebyte module.

Of course, I still look forward for the day when pluggable encodings become natively supported (just because the native code would likely run faster than JavaScript), but now I can live without it.

trevnorris · 2013-10-12T17:00:24Z

I'm not sure what you mean by "pluggable encodings" or by "single byte".
v8's internal API is called OneByte, that is essentially latin1. v8 also
natively supports TwoByte and UTF8.

bnoordhuis · 2013-10-13T09:30:12Z

I'm not sure what you mean by "pluggable encodings"

Pluggable decoders for StringDecoder.

trevnorris · 2013-10-13T20:51:03Z

Hmm. I don't really get why there'd be a request for this in core. We
support all the encodings we require, and as shown others are easily
handled in a module. If done properly there won't be any performance
difference.

But, I do forget people want core to be more than just a "core".

Mithgol · 2013-10-17T07:08:37Z

The idea is not about adding more encoding definitions to the core.

The idea is about adding an API that would allow userland modules to define more encodings for their own purposes in a way that would allow the core to run new Buffer(sting, encoding) and buf.toString(encoding, start, end) for that encodings, reusing most of the current core code of Buffers.

The typical userland code would then look like the following:

Buffer.defineEncoding('cp437', [
 0x00,   0x01,   0x02,   0x03,   0x04,   0x05,   0x06,   0x07,
 0x08,   0x09,   0x0A,   0x0B,   0x0C,   0x0D,   0x0E,   0x0F,
 0x10,   0x11,   0x12,   0x13,   0x14,   0x15,   0x16,   0x17,
 0x18,   0x19,   0x1A,   0x1B,   0x1C,   0x1D,   0x1E,   0x1F,
 0x20,   0x21,   0x22,   0x23,   0x24,   0x25,   0x26,   0x27,
 0x28,   0x29,   0x2A,   0x2B,   0x2C,   0x2D,   0x2E,   0x2F,
 0x30,   0x31,   0x32,   0x33,   0x34,   0x35,   0x36,   0x37,
 0x38,   0x39,   0x3A,   0x3B,   0x3C,   0x3D,   0x3E,   0x3F,
 0x40,   0x41,   0x42,   0x43,   0x44,   0x45,   0x46,   0x47,
 0x48,   0x49,   0x4A,   0x4B,   0x4C,   0x4D,   0x4E,   0x4F,
 0x50,   0x51,   0x52,   0x53,   0x54,   0x55,   0x56,   0x57,
 0x58,   0x59,   0x5A,   0x5B,   0x5C,   0x5D,   0x5E,   0x5F,
 0x60,   0x61,   0x62,   0x63,   0x64,   0x65,   0x66,   0x67,
 0x68,   0x69,   0x6A,   0x6B,   0x6C,   0x6D,   0x6E,   0x6F,
 0x70,   0x71,   0x72,   0x73,   0x74,   0x75,   0x76,   0x77,
 0x78,   0x79,   0x7A,   0x7B,   0x7C,   0x7D,   0x7E,   0x7F,
 0xC7,   0xFC,   0xE9,   0xE2,   0xE4,   0xE0,   0xE5,   0xE7,
 0xEA,   0xEB,   0xE8,   0xEF,   0xEE,   0xEC,   0xC4,   0xC5,
 0xC9,   0xE6,   0xC6,   0xF4,   0xF6,   0xF2,   0xFB,   0xF9,
 0xFF,   0xD6,   0xDC,   0xA2,   0xA3,   0xA5,  0x20A7, 0x192,
 0xE1,   0xED,   0xF3,   0xFA,   0xF1,   0xD1,   0xAA,   0xBA,
 0xBF,  0x2310,  0xAC,   0xBD,   0xBC,   0xA1,   0xAB,   0xBB,
0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
0x3B1,  0x3B2,  0x393,  0x3C0,  0x3A3,  0x3C3,  0x3BC,  0x3C4,
0x3A6,  0x398,  0x3A9,  0x3B4,  0x221E, 0x3C6,  0x3B5,  0x2229,
0x2261,  0xB1,  0x2265, 0x2264, 0x2320, 0x2321,  0xF7,  0x2248,
 0xB0,  0x2219,  0xB7,  0x221A, 0x207F,  0xB2,  0x25A0,  0xA0
]);

var buf = new Buffer([0xE6, 0x54, 0x6F, 0x72, 0x72, 0x65, 0x6E, 0x74, 0x73]);
console.log( buf.toString('cp437', 0, 8) ); // 'μTorrent'

If done properly there won't be any performance difference.

If done in JavaScript, it would run slower than node_buffer.cc code (C/C++ code runs ≈5x faster than JavaScript).

If done in some C/C++ addon, it would face the general inavailability of build tools on end-user Windows systems.

There's no good way to solve this problem, there's only a (not obvious) choice of the lesser ot the three evils:

If you add this feature to the core, the core will be bloated.
If you make a JavaScript userland module, it will loose 80% of the possible speed.
If you make a C/C++ addon in a userland module, then a simple npm install is no longer possible on a typical Windows machine, it has to meet node-gyp's requirements beforehand (and, for example, Windows 7 64-bit SDK .ISO file has a size of 570.9 MB).

The proposed solution is trying to “split the baby” by suggesting to put the encoder and decoder to the core but to let the encoding definitions come from the userland. The userland code may then remain in JavaScript (definitions are data structures, such as an array in the above example, and 80% speed drop won't affect them), the addition to the core is minimal.

rlidwka · 2013-10-20T17:54:52Z

That sounds like an extremely bad idea, especially if two or more modules decide to implement different versions of the same encoding with the same name.

Mithgol mentioned this issue Nov 28, 2011

Win + Russian locale + child_process.exec = ??? #2190

Closed

Mithgol mentioned this issue May 10, 2012

a binary string is a powerful construct, please keep it #3246

Closed

bnoordhuis closed this as completed Oct 10, 2013

Mithgol mentioned this issue Oct 31, 2013

Deprecate 'binary' encoding #3279

Closed

Mithgol mentioned this issue Sep 13, 2015

provide a way to introduce new encodings for new Buffer('someString', 'someEncoding') nodejs/node#2835

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide way to allow for new buffer encodings. #1772

Provide way to allow for new buffer encodings. #1772

bmeck commented Sep 26, 2011

bobrik commented Sep 26, 2011

bmeck commented Sep 26, 2011

bobrik commented Sep 26, 2011

bmeck commented Sep 26, 2011

Mithgol commented Nov 28, 2011

Mithgol commented Mar 26, 2012

Mithgol commented May 11, 2012

bnoordhuis commented May 17, 2012

bnoordhuis commented Oct 10, 2013

Mithgol commented Oct 12, 2013

trevnorris commented Oct 12, 2013

bnoordhuis commented Oct 13, 2013

trevnorris commented Oct 13, 2013

Mithgol commented Oct 17, 2013

rlidwka commented Oct 20, 2013

Provide way to allow for new buffer encodings. #1772

Provide way to allow for new buffer encodings. #1772

Comments

bmeck commented Sep 26, 2011

bobrik commented Sep 26, 2011

bmeck commented Sep 26, 2011

bobrik commented Sep 26, 2011

bmeck commented Sep 26, 2011

Mithgol commented Nov 28, 2011

Mithgol commented Mar 26, 2012

Mithgol commented May 11, 2012

bnoordhuis commented May 17, 2012

bnoordhuis commented Oct 10, 2013

Mithgol commented Oct 12, 2013

trevnorris commented Oct 12, 2013

bnoordhuis commented Oct 13, 2013

trevnorris commented Oct 13, 2013

Mithgol commented Oct 17, 2013

rlidwka commented Oct 20, 2013