How PDF to Markdown Works
PDF files store text as positioned glyphs: each character has an absolute x,y coordinate on the page. Unlike HTML or Markdown, there are no semantic tags for headings, paragraphs, or lists.
This tool uses Mozilla PDF.js to extract every text item with its position, font size, and font name. It then reconstructs structure by analyzing:
- Font size: larger text becomes headings (H1-H6)
- Y-position gaps: vertical spacing determines paragraph breaks
- Line prefixes: bullets and numbers become Markdown lists
- Font style: bold and italic map to
**bold**/*italic*