Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions packages/markitdown/src/markitdown/converters/_csv_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/csv",
"application/csv",
"text/tab-separated-values",
]
ACCEPTED_FILE_EXTENSIONS = [".csv"]
ACCEPTED_FILE_EXTENSIONS = [".csv", ".tsv"]


class CsvConverter(DocumentConverter):
Expand Down Expand Up @@ -48,7 +49,12 @@ def convert(
content = str(from_bytes(file_stream.read()).best())

# Parse CSV content
reader = csv.reader(io.StringIO(content))
delimiter = "\t" if (stream_info.extension or "").lower() == ".tsv" else ","
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the accept() function, uses the MIME type and the extension of a file to determine it's type. so In a scenario where a file arrives from a cloud service with no extension, but MIME type would be accepted. but You are currently setting the delimiter by only checking the extension type and not the MIME type. This will cause the function to crash in case a file without an extension is accepted and will automatically default to "," every time.

Example scenario :

  1. file received from cloud ( without extension but with MIME type "tab-seperated-values")
  2. the accept function allows this file and returns True.
  3. delimiter = "\t" if (stream_info.extension or "").lower() == ".tsv" else "," doesn't account for the MIME type and defaults to "," which crashes the whole program.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has been addressed in #2021

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Good catch. I'll update the delimiter detection to consider both the file extension and MIME type.


reader = csv.reader(
io.StringIO(content),
delimiter=delimiter
)
rows = list(reader)

if not rows:
Expand Down
13 changes: 13 additions & 0 deletions packages/markitdown/tests/_test_vectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,19 @@ class FileTestVector(object):
],
must_not_include=[],
),
FileTestVector(
filename="test_sample.tsv",
mimetype="text/tab-separated-values",
charset="utf-8",
url=None,
must_include=[
"| Name | Age | City |",
"| --- | --- | --- |",
"| Rahul | 20 | Delhi |",
"| Priya | 21 | Noida |",
],
must_not_include=[],
),
FileTestVector(
filename="test.json",
mimetype="application/json",
Expand Down
3 changes: 3 additions & 0 deletions packages/markitdown/tests/test_files/test_sample.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Name Age City
Rahul 20 Delhi
Priya 21 Noida