Incremental or streaming decoding #10

gwils · 2018-03-07T01:57:19Z

Currently sv will parse and load an entire document into memory before starting any decoding. On a 5GB CSV file, this would likely end in disaster.

It would be worth looking into whether we could add a "row-at-a-time" approach and what the trades-off would be.

axman6 · 2018-03-07T02:29:53Z

Row at a time or chunk of rows at a time would be good, streaming individual rows is going to be inefficient in many cases (such as the time-double value examples I've shown before), so having something like:

stream :: Monad m => Int -> Parser a -> ByteString m a -> Stream (Of (Vector a)) m (Either (Message, ByteString m r) r)

(see https://hackage.haskell.org/package/streaming-utils-0.1.4.7/docs/Data-Attoparsec-ByteString-Streaming.html#v:parsed)

would be quite useful.

Also important is streaming serialisation.

LeanderK · 2018-03-07T17:16:17Z

Hello, I am currently working on something comparable to a data-frame library and just stumbled upon this package. Looks great! 🙂

I would love to use this package for parsing CSVs etc., but I am fundamentally streaming-based, so this feature is important to me. Also, I would like to have a more low-level hook, since I am not sure which streaming-package I want to integrate with.

tonyday567 · 2018-03-10T03:26:09Z

A quick experiment: https://github.com/tonyday567/streaming-sv/

I got a fair way towards streaming with the existing library. The main blocker seemed to be the list in Records.

gwils · 2018-03-10T06:03:44Z

Hi Tony. That's quite interesting. Thanks for linking it.

The main blocker seemed to be the list in Records.

Do you mean the vector?

data Records s =
  EmptyRecords
| Records (Record s) (Vector (Newline, Record s))

Perhaps we could change that structure to better support streaming, or create a separate, more stream-oriented structure as an alternative?

tonyday567 · 2018-03-10T07:04:20Z

Yes, I meant the Vector in Records. A streaming version would be something like:

newtype RecordsS m s = RecordsS (Stream (Of (Newline, Record s)) m ())

Not sure what to do about the m. You might be able to swallow it with an existential.

I had to hardcode an Identity as in SvParser (B.ByteString Identity ()) in the example but it's going to come out of a file as a B.ByteString IO () (say), so there may need to be another type parameter anyway, and that would propagate up.

But impressive that streaming can occur out of the box without any prior engineering. Shows you're on the right track with these types.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental or streaming decoding #10

Incremental or streaming decoding #10

gwils commented Mar 7, 2018

axman6 commented Mar 7, 2018

LeanderK commented Mar 7, 2018 •

edited

Loading

tonyday567 commented Mar 10, 2018

gwils commented Mar 10, 2018

tonyday567 commented Mar 10, 2018

Incremental or streaming decoding #10

Incremental or streaming decoding #10

Comments

gwils commented Mar 7, 2018

axman6 commented Mar 7, 2018

LeanderK commented Mar 7, 2018 • edited Loading

tonyday567 commented Mar 10, 2018

gwils commented Mar 10, 2018

tonyday567 commented Mar 10, 2018

LeanderK commented Mar 7, 2018 •

edited

Loading