You are a machine learning specialist working for a language translation department of a major university.
Your university has developed a mobile/web app that translates across different languages.
You are now in the process of adding some of the more obscure languages in the far north area of the Arctic, such as Inuktun, Nganasan, and Dolgan.
These languages are spoken by very few people in their regions so you have had to build your own data sources of the language patterns for each region. Your machine learning team has decided to use Amazon Kendra to build an indexed searchable document repository.
Your team needs to use the Kendra service to explore their language data in order to clean the data to prepare it for use in your language translation software.
Your team has created your Kendra index and has added your data sources (HTML files, plain text files, PDFs, Word documents, PowerPoint presentations) in your S3 bucket to your index using the Kendra BatchPutDocument API call.
However, you see in your CloudWatch logs an HTTP status code of 400 and some of your documents have not been successfully indexed. What could be the source of the indexing failure?
Click on the arrows to vote for the correct answer
A. B. C. D.Answer: B.
Option A is incorrect.
The limit for the total size of your files from your S3 bucket is 50 MB, not 25 MB.Option B is CORRECT.
One of the limits for Kendra documents is that text extracted from an individual document cannot exceed 5 MB.Option C is incorrect.
Kendra supports the following unstructured document typesHTML files, Microsoft PowerPoint presentations, Microsoft Word documents, plain text documents, and PDFs.
Option D is incorrect.
Kendra supports the following unstructured document typesHTML files, Microsoft PowerPoint presentations, Microsoft Word documents, plain text documents, and PDFs.
Reference:
Please see the Amazon Kendra developer guide titled Types of Documents.
Please refer to the Amazon Kendra developer guide titled Quotas for Amazon Kendra.
Please review the Amazon Kendra developer guide titled Common Errors.
Please refer to the Amazon Kendra developer guidetitled BatchPutDocument.
The HTTP status code of 400 indicates a client-side error, which means that there was an issue with the request sent to the server. In this case, the request was to index documents using the Kendra BatchPutDocument API call.
Based on the information provided, there are several potential sources of the indexing failure, as outlined in the answer options:
A. The total size of your files from your S3 bucket exceeds 25 MB If the total size of the files being indexed exceeds 25 MB, this could cause the indexing failure. Kendra has a limit on the size of the documents that can be indexed, and if this limit is exceeded, the indexing will fail.
B. The text extracted from an individual Word document exceeds 5 MB Similar to the above, if the text extracted from an individual Word document exceeds 5 MB, this could also cause the indexing failure. Kendra has a limit on the size of the text that can be extracted from a document, and if this limit is exceeded, the indexing will fail.
C. PDF documents are not supported by the Kendra BatchPutDocument API call If the documents being indexed are in PDF format, this could also cause the indexing failure. While Kendra supports indexing of several file types, including HTML, plain text, and Word documents, PDF documents are not currently supported.
D. Microsoft PowerPoint presentations are not supported by the Kendra BatchPutDocument API call. Similar to PDF documents, Microsoft PowerPoint presentations are not currently supported by Kendra for indexing.
To resolve the indexing failure, the team would need to identify which of the above scenarios is causing the issue and take appropriate action. If the issue is related to file size limits, the team could consider splitting up the documents into smaller chunks. If the issue is related to unsupported file types, the team could convert the documents to a supported format or explore alternative solutions for indexing.