Supported File Formats
Alan supports a variety of file formats for creating knowledge databases. Here is an overview of the supported formats and their suitability.
File Format | Suitability |
---|---|
.txt | Very well suited |
.md | Very well suited |
.jsonl | Very well suited* |
.html | Well suited |
.docx | Well suited |
.csv | Well suited |
.xlsx | Medium |
.pptx | Medium |
.pdf | Less suitable |
* .jsonl
files allow advanced possibilities. See below
Our support team (support@alan.de) is delighted to offer consulting services to assist you in optimizing your data for Alan. This service is billed on a time and materials basis (T&M), and we're happy to provide a quote if a framework agreement is not already in place.
Explanation of Suitability
- Very well suited: These formats are ideal for creating knowledge databases as they can be easily processed and offer high text quality.
- Well suited: These formats are well suited, but there may be slight limitations in text processing.
- Medium: These formats are moderately suitable, but there may be limitations in text processing.
- Less suitable: These formats are less suitable as they may lead to imprecise text extractions or processing errors.
PDF Format
The PDF format is one of the most widely used formats for documents and is supported by Alan. However, it has a few shortcomings that can affect its suitability for text extraction and analysis.
To process PDF files, Alan accesses the text layer, which may not always be available or accurate. As a result, context information can be lost, and errors in extraction may occur.
Whenever possible, it is recommended to upload the source document of the PDF file, such as a .docx
file, instead of the PDF itself. This ensures that the text is extracted with higher accuracy and preserves the original context and formatting.
XLSX Format
The XLSX format is widely used for storing tabular data, but has some limitations when it comes to text extraction. While Alan can process XLSX files, only the text content is extracted, while other information is lost.
Specifically, the following information is not preserved:
- Geometric information, such as cell placement and layout
- Cell joining and merging information
- Formulas and calculations
If you need to upload tabular data, we recommend using CSV files instead. CSV files are better suited for text-based analysis and can help you get the most out of Alan's capabilities. If you need assistance preprocessing your data for Alan, please reach out to our support team as described above.
JSONL Format
.jsonl
files provide you with more control over the extraction of information from your documents. These .jsonl
files are typically generated from source documents to prepare their information in a structured format for processing by Alan.
The JSONL format (JSON Lines) is a text format that allows data to be stored in a simple and easily readable form. Each line in a JSONL file contains a single JSON object, each separated by a newline (\n
). This allows for efficient processing of even large amounts of data.
When uploading a .jsonl
file to Alan, the individual lines are indexed separately. The text in a line is embedded as a whole into the vector database - there will be no further chunking of the texts. This gives you control over the size of the chunks and allows you to structure texts as best suited for your application. Further details on the indexing pipeline can be found here.
Alan expects each JSON object (each line) to contain the following fields:
- title: The display name in the source references
- content: The text content
- source: (Optional) an absolute http(s) link to the source, which will be displayed in the source references.
An example of a JSONL document:
{"title": "Title 1", "content": "This is an example text.", "source": "https://www.example.com"}
{"title": "Title 2", "content": "This is another example text.", "source": "https://www.example.com/example2"}
{"title": "Title 3", "content": "This is a third example text."}
Use Cases
- Text Segmentation: You can structure the texts in the JSONL files to be divided into meaningful sections. This allows for more accurate analysis and better results.
- Import of Structured Data: You can store structured data in JSONL files and import them into Alan. These can come from ticket systems (one ticket per
.jsonl
line), databases, or other sources.