Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

-f docx+styles -t dokuwiki results in raw html formatted lists #8920

Closed
barredespace opened this issue Jun 27, 2023 · 6 comments
Closed

-f docx+styles -t dokuwiki results in raw html formatted lists #8920

barredespace opened this issue Jun 27, 2023 · 6 comments
Labels

Comments

@barredespace
Copy link

Pandoc 3.1.3 / OSX 12.6.5 / HomeBrew version

this is my command line that causes the bug :
pandoc -f docx+styles -t dokuwiki juste_listes.docx -o juste_liste.txt

juste_liste.docx is a document with just one unordered list in it.

If I use this command line pandoc -f docx -t dokuwiki juste_listes.docx -o juste_liste.txt, without +styles, I get this result :

  * Liste 1
  * liste 2
  * liste 3
    * liste 3a
    * liste 3b
    * liste 3c
  * liste 4

If I add +styles here is what I get :

<HTML><ul></HTML>
<HTML><li></HTML><HTML><p></HTML>Liste 1<HTML></p></HTML>
<HTML></li></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 2<HTML></p></HTML>
<HTML></li></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 3<HTML></p></HTML>

<HTML><ul></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 3a<HTML></p></HTML>
<HTML></li></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 3b<HTML></p></HTML>
<HTML></li></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 3c<HTML></p></HTML>
<HTML></li></HTML><HTML></ul></HTML>
<HTML></li></HTML>
<HTML><li></HTML><HTML><p></HTML>liste 4<HTML></p></HTML>
<HTML></li></HTML><HTML></ul></HTML>

The first command, without +styles gives me syntactically correct dokuwiki format.

I need the +styles extension to retain custom styles convert them with a lua filter and later parse them with a dokuwiki plugin.

@jgm
Copy link
Owner

jgm commented Jun 27, 2023

Can you show the output of

pandoc -f docx+styles juste_liste.docx -t native

and

pandoc -f docx juste_liste.docx -t native

respectively? There must be a change in the AST that will explain why this is happening.
Generally pandoc will fall back to raw HTML when a list contains a feature that is too complex to represent using regular dokuwiki syntax.

@barredespace
Copy link
Author

Here they are :

pandoc -f docx+styles juste_liste.docx -t native

[ BulletList
    [ [ Div
          ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
          [ Para [ Str "Liste" , Space , Str "1" ] ]
      ]
    , [ Div
          ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
          [ Para [ Str "Liste" , Space , Str "2" ] ]
      ]
    , [ Div
          ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
          [ Para [ Str "Liste" , Space , Str "3" ] ]
      , BulletList
          [ [ Div
                ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
                [ Para [ Str "Liste" , Space , Str "3a" ] ]
            ]
          , [ Div
                ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
                [ Para [ Str "Liste" , Space , Str "3b" ] ]
            ]
          , [ Div
                ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
                [ Para [ Str "Liste" , Space , Str "3c" ] ]
            ]
          ]
      ]
    , [ Div
          ( "" , [] , [ ( "custom-style" , "List Paragraph" ) ] )
          [ Para [ Str "Liste" , Space , Str "4" ] ]
      ]
    ]
]

and

pandoc -f docx juste_liste.docx -t native

[ BulletList
    [ [ Para [ Str "Liste" , Space , Str "1" ] ]
    , [ Para [ Str "Liste" , Space , Str "2" ] ]
    , [ Para [ Str "Liste" , Space , Str "3" ]
      , BulletList
          [ [ Para [ Str "Liste" , Space , Str "3a" ] ]
          , [ Para [ Str "Liste" , Space , Str "3b" ] ]
          , [ Para [ Str "Liste" , Space , Str "3c" ] ]
          ]
      ]
    , [ Para [ Str "Liste" , Space , Str "4" ] ]
    ]
]

@jgm
Copy link
Owner

jgm commented Jun 28, 2023

OK, it's the Div that is inserted for custom-styles that is blocking the regular list.
However, the div isn't being represented in HTML anyway, so we can probably improve this.

@jgm jgm closed this as completed in c908867 Jun 28, 2023
@StefanP74
Copy link

StefanP74 commented Jul 31, 2024

Pandoc 3.3 / Win10 22H2

I have a problem with enumeration/listing from docx to dokuwiki ... like above.

pandoc document.docx -f docx -t dokuwiki -o output.txt --extract-media ./

and the result is:

<HTML><ul></HTML>
<HTML><li></HTML><HTML><p></HTML>blablablablablablabla<HTML></p></HTML>
<HTML><ul></HTML>
<HTML><li></HTML><HTML><p></HTML>blablablablablablablablabla<HTML></p></HTML><HTML></li></HTML>
.......

but it should be:

  * blablablabla
  * blablablabla
  * blablablabla

all other codes were well ported to dokuwiki language.
did i something wrong?
i tried it also with extention +styles ... same problem.

@jgm
Copy link
Owner

jgm commented Jul 31, 2024

The dokuwiki writer uses HTML tags when the list is not "simple" -- that is, when the elements of the list are not Plain blocks or sublists. Here it seems you have Para blocks instead of Plain.

This behavior preserves the distinction between tight and loose lists, which it seems dokuwiki has no other way of representing.

I think, though, that it would be better to modify the dokuwiki writer so that it treats lists like this as if they had Plain and not Para -- we'd lose that distinction, but the output would be more natural.

[Edit: looking at the code, we should be handling Para like Plain here. Can you upload a docx I can test with?]

@StefanP74
Copy link

Hello, thx for this information.

Cause of this, i found the problem: theres a jpg inside the list.
I removed the jpg and the output is perfect.
Test-file with jpg inside list is attached.

List_Test.docx

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants