Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Parse2 update (#99) #109

Merged
merged 18 commits into from
Apr 11, 2022
Merged

Parse2 update (#99) #109

merged 18 commits into from
Apr 11, 2022

Conversation

agalitsyna
Copy link
Member

@agalitsyna agalitsyna commented Dec 8, 2021

parse2 is a module specifically designed for parsing long Hi-C reads.
This is an improved version of parse2 with resolved comments from the previous PR: #96

Major changes:

  • Separation of parse and parse2 modules. Parse has an option --walks-policy all, which parses long walks, but always reporting pair orientation and outer positions of 5'-ends, as if each pair was read in paired-end mode independently. Parse2 is specifically designed for long walks, and has options --report-position and --report-orientation, which might be used to report junctions, or reads, or walks.

  • Parse2 has an option to parse single-end reads, --single-end option, tested on minimap2 output for MC-3C.

  • Parse2 has the max_fragment_size instead instead of parse's max_molecule_size, which help to determine the overlapping ends of forward and reverse reads.

  • Recent update simplifies the code:

    • single _parse library used by both parse and parse2,
    • a number of functions that reduce repetitive code, e.g. push_pair function,
    • dosctrings and documented structure of _parse library.
  • Both parse and parse2 have the options to report 5' or 3' ends; to flip alignments according to chromosome coordinate.

  • Both parse and parse2 have the pysam backend

  • Important improvements of the tests for parse and parse2

1. parse is now fully pysam-powered, which is propagated to dependencies and
setup.

2. Novel classes for simplified access to alignments with pysam:
AlignmentFilePairtoolized, AlignedSegmentPairtoolized in _parse_pysam.pyx.

2. Tests updated; test sam files can be parsed by pysam.
* Parse2: created. Improved version of parse2 with resolved comments from the previous PR: #96

Major changes:

* Single-end mode of parse2 added, --single-end option. Tested on minimap2 output for MC-3C.

* parse2 now has three possible coordinate systems for reporting: read, walk and pair (described in the docstring). Default coord system "read" tested.

* demo notebook with MC-3C and Arima datasets

* simplified code of parse2, e.g. push_pair function added instead of repetitive code
improved docstrings

* Max molecule size replaced with max fragment size.  

* parse2(docs): Documentation improved, #96 (comment) resolved.

* Option to report 5' or 3' ends option added.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@agalitsyna agalitsyna mentioned this pull request Apr 6, 2022
31 tasks
@agalitsyna agalitsyna changed the base branch from master to pre0.4.0 April 11, 2022 22:11
@agalitsyna agalitsyna merged commit db465f6 into pre0.4.0 Apr 11, 2022
agalitsyna added a commit that referenced this pull request Jun 1, 2022
* Separate cli and lib

* pairtools flip fix for unannotated chromosomes, resolving #91

* handle empty chromosomes, resolved
#76

* fixed rfrags indexing and first rfrag omission, resolved
#73

* resolved or deprecated suggestions in #16

* merge improvements, header merge fixed

- resolved merge without arguments: #61

- option to add only the first header in merge, resolved
#18

* in merge, added option to concatenate instead of merge sorted inputs,
resolving: #23

* merge now checks that columns of inputs are the same

* I/O improvements

- auto_open defaults to stdin/stdout when path evaluates to False.
resolved #48

- auto_open defaults to stdin/stdout when the path is "-"

- if the stream is optional, it's controlled by the module itself

* Parse2 update (#99) (#109)

Improved version of parse2 with resolved comments from the previous PR: #96

- Separation of parse and parse2 modules. Parse has an option --walks-policy all, which parses long walks, but always reporting pair orientation and outer positions of 5'-ends, as if each pair was read in paired-end mode independently. Parse2 is specifically designed for long walks, and has options --report-position and --report-orientation, which might be used to report junctions, or reads, or walks.

- Parse2 has an option to parse single-end reads, --single-end option, tested on minimap2 output for MC-3C.

- Parse2 has the max_fragment_size instead instead of parse's max_molecule_size, which help to determine the overlapping ends of forward and reverse reads.

- Recent update simplifies the code: single _parse library used by both parse and parse2,

- a number of functions that reduce repetitive code, e.g. push_pair function,

- dosctrings and documented structure of _parse library.

- Both parse and parse2 have the options to report 5' or 3' ends; to flip alignments according to chromosome coordinate.

- Both parse and parse2 have the pysam backend

- Improvements of the tests for parse and parse2

- Documentation includes description of various --report-orientation and --report-position cases.

* Merge pairlib into pairtools.lib.

* CLI for scalings added.

* stats output in yaml format

* Header CLI (#121)

- new module called by `pairtools header`
- submodules: 
  - generate : Generate the header
  - set-columns : Add the columns to the .pairs/pairsam file
  - transfer : Transfer the header from one pairs file to another
  - validate-columns : Validate the columns of the .pairs/pairsam file
- resolves #119 
- option remove-columns for `pairtools select`: Remove the columns from .pairs/pairsam file

* pairtools phase critical update (#114)

* imporant fixes: - cython dedup with no-parent id forgotten counter reset; - sphinx doc update (added pysam); - header warning if empty and error if try to add a field to empy one

* Add summaries (#105)

* Add functions for duplication tile and complexity

* Make dedup stats!

* Benchmarks finalization

* [WIP] Stats split by filters (#132)

* Markasdup lib removed; markasdup CLI explanation improved

* dedup filter stats added and tested

Co-authored-by: Aleksandra Galitsyna <agalitzina@gmail.com>
Co-authored-by: Ilya Flyamer <flyamer@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant