Bug
\CsvConverter\ uses \stream_info.charset\ (detected from first 4096 bytes) to decode CSV files. When the first 4096 bytes are ASCII-only but the full file contains non-ASCII characters (e.g., accented characters, UTF-8), .decode('ascii')\ raises \UnicodeDecodeError.
Reproducible example
\\python
from markitdown import MarkItDown
import io
CSV with UTF-8 characters beyond the first 4096 bytes
buf = io.BytesIO(('x' * 4096 + ',caf\u00e9\n').encode('utf-8'))
md = MarkItDown()
result = md.convert(buf)
\\
Expected
CSV content is decoded successfully, falling back to charset_normalizer when the detected charset fails.
Actual
\UnicodeDecodeError\ because \stream_info.charset\ reports 'ascii'\ but the file contains UTF-8.
Affected file
\packages/markitdown/src/markitdown/converters/_csv_converter.py\ lines 45-48
Note
This is the same class of bug as #1505 (PlainTextConverter), which was fixed in PR #1938.
Bug
\CsvConverter\ uses \stream_info.charset\ (detected from first 4096 bytes) to decode CSV files. When the first 4096 bytes are ASCII-only but the full file contains non-ASCII characters (e.g., accented characters, UTF-8), .decode('ascii')\ raises \UnicodeDecodeError.
Reproducible example
\\python
from markitdown import MarkItDown
import io
CSV with UTF-8 characters beyond the first 4096 bytes
buf = io.BytesIO(('x' * 4096 + ',caf\u00e9\n').encode('utf-8'))
md = MarkItDown()
result = md.convert(buf)
\\
Expected
CSV content is decoded successfully, falling back to charset_normalizer when the detected charset fails.
Actual
\UnicodeDecodeError\ because \stream_info.charset\ reports 'ascii'\ but the file contains UTF-8.
Affected file
\packages/markitdown/src/markitdown/converters/_csv_converter.py\ lines 45-48
Note
This is the same class of bug as #1505 (PlainTextConverter), which was fixed in PR #1938.