Skip to content

CsvConverter throws UnicodeDecodeError when charset detection from partial content returns 'ascii' #1949

@hanhan761

Description

@hanhan761

Bug

\CsvConverter\ uses \stream_info.charset\ (detected from first 4096 bytes) to decode CSV files. When the first 4096 bytes are ASCII-only but the full file contains non-ASCII characters (e.g., accented characters, UTF-8), .decode('ascii')\ raises \UnicodeDecodeError.

Reproducible example

\\python
from markitdown import MarkItDown
import io

CSV with UTF-8 characters beyond the first 4096 bytes

buf = io.BytesIO(('x' * 4096 + ',caf\u00e9\n').encode('utf-8'))
md = MarkItDown()
result = md.convert(buf)
\\

Expected

CSV content is decoded successfully, falling back to charset_normalizer when the detected charset fails.

Actual

\UnicodeDecodeError\ because \stream_info.charset\ reports 'ascii'\ but the file contains UTF-8.

Affected file

\packages/markitdown/src/markitdown/converters/_csv_converter.py\ lines 45-48

Note

This is the same class of bug as #1505 (PlainTextConverter), which was fixed in PR #1938.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions