Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Possible newlineFailsafe feature? #51

Open
cmangani opened this issue Oct 1, 2018 · 0 comments
Open

Possible newlineFailsafe feature? #51

cmangani opened this issue Oct 1, 2018 · 0 comments

Comments

@cmangani
Copy link

cmangani commented Oct 1, 2018

Hi,

Great Library. Just wanted to let you know an issue I faced.. SFDC has a non-compliant CSV export (from it's bulk export tool).. producing rows like the "00T1p00002Pq1pREAR" row below.. This causes your program to stay in the _isQuoted state as it flows through potentially hundreds of records until it encounters another quote to end this state... Like, for instance, if you save below as a csv file , the 4 rows past 00T1p00002Pq1pREAR don't get digested..

Anyways.. our workaround was that I coded a local version of your csv-stringify with the following.. Anyways.. this was just sort of an "escape hatch". Using the existing csv-stringify, there out of 82 million SFDC "task" records, we only processed 76 million .. w/ my code change, we processed 81,994,204 records due to my "recovery" code..

I know this doesn't work in all cases (e.g. in a String field that might just magically hit my Regex.. (in our use case, this is very improbable).. And perhaps you have a real fix to the issue.. but just figured I'd drop a line..

Thanks again for the great work!

My Changed Code in csv-stringify

/*
...

  • @param {boolean} [opts.columns=false] Whether to parse headers
  • @param {object} [opts.newlineRegexFailsafe] If the parser gets tripped up, detect the start of a new line w/ a regex. maxReadAheadLength and regex
  • @param {function} [cb] Callback function
    */

...

// newline
if (!state._isQuoted && (c === opts.newline || c === opts.newline[0])) {
state._newlineDetected = true
queue(c)
continue
}

  **if (opts.newlineRegexFailsafe && state._isQuoted && (c === opts.newline || c === opts.newline[0])) {        
    // find next delimiter.. then apply regex to see if we are at a newline
    var buff = [];
    var z=1;
    var nxtChar;
    while(
      (!opts.newlineRegexFailsafe.maxReadAheadLength || z<opts.newlineRegexFailsafe.maxReadAheadLength) // restrict how many characters to read ahead.. performance optimization
       && i+z <  data.length  // don't read past end of input
       && (nxtChar = data.charAt(i+z))!=opts.delimiter) { // read up until the next encounter of delimiter
      buff.push(nxtChar);
      z++;
    }        
    var succeedingCharacters = buff.join("");
    if (succeedingCharacters.match(opts.newlineRegexFailsafe.regex)) {
      emitLine(this)   
      continue       
    }
  }**

My Calling App

const csv = require('csv-streamify');
const input = 'subset.csv';
const fs = require('fs');
// If parser is in _isQuoted state.. As a failsafe for malformed, multi-quoted fields, If I get to a newline
// that has a pattern following of 18 characters w/ first 3 of a certain SFDC Id convention (e.g. 00T)
// we will consider this the new line.. this previous record we would be emitting will be incomplete (e.g. will)
// not contain all of the columns, as it is caught up in the multi-quote column.... It will be left to the caller
// to check for the correct number of columns, and dispose/or/deal with this errart row.
const parser = csv({"newlineRegexFailsafe" : {"regex" : "^(00T|001|003|005|a21|801|006|00U|a25)[a-zA-Z0-9]{15}$", "maxReadAheadLength" : 20}});
//const parser = csv()// test with this one.. you will see it fails on the 00T1p00002Pq1pREAR record

// emits each line as a buffer or as a string representing an array of fields
var idx = 0;
parser.on('data', function (values) {
console.log(idx + ":" + values[0] + ":" + values.length);
//NOTE: this is a "Task", 79 columns..
//if values.length = 79, probably a good record (perhaps add some simple heuristic to verify a few of the expected contents of columns
//if values.length < 79, was a victim of the double-double-quote issue.. probably easiest to dispose (and log) of record than try to recover
//if values.length > 79, was a victim, but was a rare/unfortunate victim in that the chunk size of the parser straddled the maxReadAheadLength
//string and thus, it ran on into the next record. NOTE: At most we will lose 2 records to the double-double quote issue, as the code will
//read in the next chunk and continue to read through the next record (making values.length > 79) in this _isQuoted state
// until it again encouters the newlineRegexFailsafe regex on the subsequent record.
idx++;
});

//001, 003, 00U, 006, 801, a21, a25, 00T, 00500T1p00002Pq1pREAR
fs.createReadStream(input, {start: 0}).pipe(parser);

Partial CSV Export from SFDC

00T1p00002Pq1pOEAR,0032400000pPGSuAAO,0012400000poV9uAAE,E-mailed - Anord,2017-03-01,Completed,Normal,false,005U0000000OflxIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2017-03-01T14:55:03.000+0000,005U0000000OflxIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.348341659,,Non-Target,,,,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.348341659,,,,,
00T1p00002Pq1pREAR,0032400000pPGSuAAO,0012400000poV9uAAE,Roofbuilders,2018-03-07,Completed,Normal,false,0051p000008bcDxAAI,"""",Call,false,0012400000poV9uAAE,true,2018-03-07T17:50:08.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:04.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.412305459,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2018-03-08T09:39:04.000+0000,0.0,,,,,,,,,,,,,,,,,R.412305459,,,,,
00T1p00002Pq1pSEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message & E-mailed - Roofbuilders,2018-03-08,Completed,Normal,false,0051p000008bcDxAAI,#NIS,Attempted Contact,false,0012400000poV9uAAE,true,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-08T09:39:03.000+0000,0051p000008bcDxAAI,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.412377485,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.412377485,,,,,
00T1p00002Pq1pWEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Left Message - Attempted Contact,2017-07-28,Completed,Normal,false,005U0000004YtsvIAC,,Attempted Contact,false,0012400000poV9uAAE,true,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2017-07-28T16:33:47.000+0000,005U0000004YtsvIAC,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Attempted Contact,false,,,,,R.374017937,,Non-Target,"ONS-Fairfax, VA - P&I-00422",00eU0000000eAbrIAE,Aerotek Chesapeake,Attempted Contact,,false,,0.0,,,,,,,,,,,,,,,,,R.374017937,,,,,
00T1p00002Pq1pXEAR,0032400000pPGSuAAO,0012400000poV9uAAE,Candidate Summary/G2 Edited by Joseph Henry Breithaupt,2017-07-05,Completed,Normal,false,00524000003MqTYAA0,"really nice guy, jumpy between contracts, worked through people solutions from 2012-14, ennis flint left because he got his bachelors degree and got offered the position at alloy polymers, he is interested in getting more hands on with PLC work or a higher paying maintenance engineer role, he is working a split shift at alloy which he hates (comes in for the morning, leaves and comes back for the evening) he is currently making 28/hr but would be interested in 60k and up because he does get overtime, sending him job descriptions for foley and sabra",G2,false,0012400000poV9uAAE,true,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2017-07-05T09:04:25.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,G2,false,,,,,R.369709668,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,G2,,false,,0.0,,,,,,,,,,,,,,,,,R.369709668,,,,,
00T1p00002Pq1pYEAR,0032400000pPGSuAAO,0012400000poV9uAAE,TT,2017-07-06,Completed,Normal,false,00524000003MqTYAA0,"not the right experience for either sabra or foley, he is interested in staying in touch for other roles moving forward, sharp guy",Call,false,0012400000poV9uAAE,true,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2017-07-06T11:07:09.000+0000,00524000003MqTYAA0,2018-03-14T16:47:33.000+0000,false,false,,,,,,false,,false,,,,,,,,,,,Task,Call,false,,,,,R.370010467,,Non-Target,"ONS-Richmond, VA-00494",00eU0000000eAbrIAE,Aerotek Chesapeake,Call,,true,2017-07-06T00:00:00.000+0000,0.0,,,,,,,,,,,,,,,,,R.370010467,,,,,

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant