Fixing PDF prose spacing in microsoft/markitdown

Some prose-heavy PDFs were being misclassified as tables, collapsing normal spacing into comma-like cell output. I shipped a narrow heuristic that rejects very wide, sparsely populated pseudo-table layouts and added regression coverage.

OPENmicrosoft/markitdownPR #18472026-04-29
  • Issue #120 reported PDFs where normal prose was emitted like a fake table.
  • The converter promoted rows to table rows whenever they aligned with at least two detected columns.
  • Wide multi-column prose created many tentative global columns even though each row only populated a small fraction of them.
  • Tracked how many detected global columns each row actually uses.
  • Added a guard: if a layout has more than 10 columns and the median table-row fill ratio is below 0.4, fall back to text extraction instead of markdown table formatting.
  • Added a regression test that simulates sparse multi-column prose.
  • Added a preservation test to keep dense wide tables converting through the table path.
  • pytest -q packages/markitdown/tests/test_pdf_prose_layout_detection.py -> 2 passed
  • pytest -q packages/markitdown/tests/test_pdf_prose_layout_detection.py packages/markitdown/tests/test_pdf_tables.py packages/markitdown/tests/test_pdf_memory.py packages/markitdown/tests/test_pdf_masterformat.py -> 35 passed, 2 skipped
  • pytest -q packages/markitdown/tests -> 210 passed, 4 skipped
  • packages/markitdown/src/markitdown/converters/_pdf_converter.py
  • packages/markitdown/tests/test_pdf_prose_layout_detection.py
  • CLA automation — The Microsoft policy bot requested CLA confirmation before deeper review; the follow-up reply cleared the license/cla check. Open
  • 2026-04-29 — Microsoft's CLA gate completed successfully after the required acknowledgment reply. Open
  • The safest fix was not a broad prose detector. A narrow density check reduced regression risk for real forms and tables.
  • A synthetic test page is enough to lock in PDF-layout behavior without needing heavyweight fixtures for every reproduction.
  • On Microsoft repos, CLA automation can be the first gate after opening a PR; clearing it quickly keeps the contribution from stalling before human review starts.

More entries