Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Provide way to allow for new buffer encodings. #1772

Closed
bmeck opened this issue Sep 26, 2011 · 15 comments
Closed

Provide way to allow for new buffer encodings. #1772

bmeck opened this issue Sep 26, 2011 · 15 comments

Comments

@bmeck
Copy link
Member

bmeck commented Sep 26, 2011

Right now buffers use hard coded values for encodings, when making certain protocols ( http://self-issued.info/docs/draft-jones-json-web-signature.html#base64urllogic ) this requires creating entirely new wrappers for toString and no way to support it via the Buffer constructor.

Having something similar to:

Buffer.registerStringEncoding(name, function stringify(buffer) => String, function parse(string) => Buffer) throws on duplicate name

The performance issue could be worked around by never encoding to a buffer until a toString or property related to the buffer value is requested (and then not even doing that if the encodings match on toString), and then cacheing the result.

@bobrik
Copy link

bobrik commented Sep 26, 2011

Extending built-in object isn't a good solution.

@bmeck
Copy link
Member Author

bmeck commented Sep 26, 2011

bobrik, Buffer is not a built-in for javascript, it is created by Node's core.

@bobrik
Copy link

bobrik commented Sep 26, 2011

It's built-in for node. My own opinion: create helper object to do that task.

In other case, how could I know in any module that encoding is registered?

@bmeck
Copy link
Member Author

bmeck commented Sep 26, 2011

This is a feature request to allow extension to the "built-in" Buffer
object, not change any existing behavior/signature beyond new
encodings; regardless of how it should be done now I feel that having
a multitude of wrappers may introduce a difficulty in the encoding /
decoding of formats as the wrappers may have many signatures and
encourage tight coupling to systems.

In order to detect if an encoding is present, one would need to
manually add the encoding to ensure it is registered, or have a means
to detect if the encoding is present when the module needing the
encoding is required.

Simply requiring 'encodingType.js' seems simplistic and being at setup
time would most likely be fine in order to install these items in a
similar manner that requiring coffeescript acts on the '.coffee'
files. This is less than elegant than it could be, but far more
elegant than providing multiple wrappers for buffers that would
perform the same task of implementing a new encoding in potentially
many different ways.

On Mon, Sep 26, 2011 at 1:02 PM, Ian Babrou
reply@reply.github.com
wrote:

It's built-in for node. My own opinion: create helper object to do that task.

In other case, how could I know in any module that encoding is registered?

Reply to this email directly or view it on GitHub:
#1772 (comment)

@Mithgol
Copy link

Mithgol commented Nov 28, 2011

@bobrik

how could I know in any module that encoding is registered?

Very good point. Node.js probably needs boolean .stringEncodingRegistered(encodingName) in addition to .registerStringEncoding(name, decoder, encoder).

@Mithgol
Copy link

Mithgol commented Mar 26, 2012

However, the node_buffer.cc file does not look like it was ever written with freely extensible list of encodings in mind.

There are only a few pre-defined encodings, they appear in many elseifs, they are hardcoded in constructor names, etc.

It won't be easy to write a function that adds yet another encoding to that list, because there was never a list as a real data structure to begin with, and you cannot alter a pile of elseif lines programmaticaly. You'll have to rewrite a good deal of the existing code before you start writing such a function.

@Mithgol
Copy link

Mithgol commented May 11, 2012

I also think that JavaScript functions stringify(buffer) and parse(string) are going to be slow.

For many use cases that is inevitable; however, for simple encodings (such as CP 866, KOI8-R, Windows-1251) the node_buffer.cc could contain some function like PHP's strtr() that accepts a simple array of replace pairs and does most of the work on its own (C++, not JavaScript) level, which is faster.

The module node-iconv (by @bnoordhuis) already does something similar, but supports only some predefined encodings, and is non-Windows, and requires compilation (while Node.js core binaries for Windows are pre-built and available for download).

@bnoordhuis
Copy link
Member

The module node-iconv (by @bnoordhuis) already does something similar, but supports only some predefined encodings, and is non-Windows, and requires compilation (while Node.js core binaries for Windows are pre-built and available for download).

node-iconv should now compile on Windows: bnoordhuis/node-iconv#27

As to 'only some predefined encodings', we're talking about 120 supported encodings. :-)

@bnoordhuis
Copy link
Member

Closing, no activity in over a year and node-iconv and iconv-lite have the market pretty much cornered.

We may someday allow pluggable encodings but that's ultra-low priority.

@Mithgol
Copy link

Mithgol commented Oct 12, 2013

Iconv-lite seems to have the .decode(buffer, encoding) signature, while Node.js Buffer's .toString does also have start and end optional parameters.

I worked around it by creating my own node-singlebyte module.

Of course, I still look forward for the day when pluggable encodings become natively supported (just because the native code would likely run faster than JavaScript), but now I can live without it.

@trevnorris
Copy link

I'm not sure what you mean by "pluggable encodings" or by "single byte".
v8's internal API is called OneByte, that is essentially latin1. v8 also
natively supports TwoByte and UTF8.

@bnoordhuis
Copy link
Member

I'm not sure what you mean by "pluggable encodings"

Pluggable decoders for StringDecoder.

@trevnorris
Copy link

Hmm. I don't really get why there'd be a request for this in core. We
support all the encodings we require, and as shown others are easily
handled in a module. If done properly there won't be any performance
difference.

But, I do forget people want core to be more than just a "core".

@Mithgol
Copy link

Mithgol commented Oct 17, 2013

The idea is not about adding more encoding definitions to the core.

The idea is about adding an API that would allow userland modules to define more encodings for their own purposes in a way that would allow the core to run new Buffer(sting, encoding) and buf.toString(encoding, start, end) for that encodings, reusing most of the current core code of Buffers.

The typical userland code would then look like the following:

Buffer.defineEncoding('cp437', [
 0x00,   0x01,   0x02,   0x03,   0x04,   0x05,   0x06,   0x07,
 0x08,   0x09,   0x0A,   0x0B,   0x0C,   0x0D,   0x0E,   0x0F,
 0x10,   0x11,   0x12,   0x13,   0x14,   0x15,   0x16,   0x17,
 0x18,   0x19,   0x1A,   0x1B,   0x1C,   0x1D,   0x1E,   0x1F,
 0x20,   0x21,   0x22,   0x23,   0x24,   0x25,   0x26,   0x27,
 0x28,   0x29,   0x2A,   0x2B,   0x2C,   0x2D,   0x2E,   0x2F,
 0x30,   0x31,   0x32,   0x33,   0x34,   0x35,   0x36,   0x37,
 0x38,   0x39,   0x3A,   0x3B,   0x3C,   0x3D,   0x3E,   0x3F,
 0x40,   0x41,   0x42,   0x43,   0x44,   0x45,   0x46,   0x47,
 0x48,   0x49,   0x4A,   0x4B,   0x4C,   0x4D,   0x4E,   0x4F,
 0x50,   0x51,   0x52,   0x53,   0x54,   0x55,   0x56,   0x57,
 0x58,   0x59,   0x5A,   0x5B,   0x5C,   0x5D,   0x5E,   0x5F,
 0x60,   0x61,   0x62,   0x63,   0x64,   0x65,   0x66,   0x67,
 0x68,   0x69,   0x6A,   0x6B,   0x6C,   0x6D,   0x6E,   0x6F,
 0x70,   0x71,   0x72,   0x73,   0x74,   0x75,   0x76,   0x77,
 0x78,   0x79,   0x7A,   0x7B,   0x7C,   0x7D,   0x7E,   0x7F,
 0xC7,   0xFC,   0xE9,   0xE2,   0xE4,   0xE0,   0xE5,   0xE7,
 0xEA,   0xEB,   0xE8,   0xEF,   0xEE,   0xEC,   0xC4,   0xC5,
 0xC9,   0xE6,   0xC6,   0xF4,   0xF6,   0xF2,   0xFB,   0xF9,
 0xFF,   0xD6,   0xDC,   0xA2,   0xA3,   0xA5,  0x20A7, 0x192,
 0xE1,   0xED,   0xF3,   0xFA,   0xF1,   0xD1,   0xAA,   0xBA,
 0xBF,  0x2310,  0xAC,   0xBD,   0xBC,   0xA1,   0xAB,   0xBB,
0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 0x2562, 0x2556,
0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 0x255B, 0x2510,
0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 0x255E, 0x255F,
0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 0x256C, 0x2567,
0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 0x2553, 0x256B,
0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 0x2590, 0x2580,
0x3B1,  0x3B2,  0x393,  0x3C0,  0x3A3,  0x3C3,  0x3BC,  0x3C4,
0x3A6,  0x398,  0x3A9,  0x3B4,  0x221E, 0x3C6,  0x3B5,  0x2229,
0x2261,  0xB1,  0x2265, 0x2264, 0x2320, 0x2321,  0xF7,  0x2248,
 0xB0,  0x2219,  0xB7,  0x221A, 0x207F,  0xB2,  0x25A0,  0xA0
]);

var buf = new Buffer([0xE6, 0x54, 0x6F, 0x72, 0x72, 0x65, 0x6E, 0x74, 0x73]);
console.log( buf.toString('cp437', 0, 8) ); // 'μTorrent'

If done properly there won't be any performance difference.

If done in JavaScript, it would run slower than node_buffer.cc code (C/C++ code runs ≈5x faster than JavaScript).

If done in some C/C++ addon, it would face the general inavailability of build tools on end-user Windows systems.

There's no good way to solve this problem, there's only a (not obvious) choice of the lesser ot the three evils:

  • If you add this feature to the core, the core will be bloated.
  • If you make a JavaScript userland module, it will loose 80% of the possible speed.
  • If you make a C/C++ addon in a userland module, then a simple npm install is no longer possible on a typical Windows machine, it has to meet node-gyp's requirements beforehand (and, for example, Windows 7 64-bit SDK .ISO file has a size of 570.9 MB).

The proposed solution is trying to “split the baby” by suggesting to put the encoder and decoder to the core but to let the encoding definitions come from the userland. The userland code may then remain in JavaScript (definitions are data structures, such as an array in the above example, and 80% speed drop won't affect them), the addition to the core is minimal.

@rlidwka
Copy link

rlidwka commented Oct 20, 2013

That sounds like an extremely bad idea, especially if two or more modules decide to implement different versions of the same encoding with the same name.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

6 participants