Analytics & Data

Athena, Glue, EMR, Redshift, Kinesis, and data analytics patterns.

0/15 đã trả lời

Câu 1. A company stores application logs in S3 in JSON format. They need to run ad-hoc SQL queries on this data without setting up any servers or loading data into a database. Which service should they use?
Câu 2. A company runs frequent Athena queries on a large S3 dataset (10 TB). Their Athena costs are high because each query scans the entire dataset. Which TWO optimizations would reduce costs the MOST? (Select the best single answer)
Câu 3. A company has data in multiple formats across S3, RDS, and Redshift. They need to create a unified metadata catalog so that Athena, EMR, and Redshift Spectrum can all discover and query the data. Which service provides this centralized metadata catalog?
Câu 4. A company receives raw CSV files in S3 daily. They need to automatically transform these files into Parquet format, apply data quality checks, and load the results into a data lake. The process should be serverless and triggered by file arrival. Which service should they use?
Câu 5. A data science team needs to run Apache Spark jobs on a 500 GB dataset for machine learning model training. The jobs run for 2-4 hours and are submitted several times per week. They need full control over the Spark configuration. Which service should they use?
Câu 6. A company runs daily EMR Spark jobs that process 2 TB of data. The jobs take 3 hours on a cluster of 20 m5.xlarge instances. They want to reduce costs while maintaining job completion time. Which approach provides the BEST cost optimization?
Câu 7. A company needs a fully managed data warehouse to run complex analytical queries on structured data from multiple sources, with results returned in seconds. Which service should they use?
Câu 8. A company uses Amazon Redshift for their data warehouse. They also have a large amount of historical data in S3 that is queried infrequently. They want to query both Redshift tables and S3 data using a single SQL query without loading the S3 data into Redshift. Which feature enables this?
Câu 9. A company needs to collect and process real-time clickstream data from their website, with the ability to process records within seconds of arrival. Which service is designed for real-time data streaming?
Câu 10. A company needs to load streaming data from Kinesis Data Streams into S3 in near real-time, automatically batching records and converting them from JSON to Parquet format. Which service provides this with the LEAST operational overhead?
Câu 11. A company uses Kinesis Data Streams with 10 shards. Their consumer application processes records but falls behind during peak traffic. They need to increase throughput without modifying the consumer application. What should they do?
Câu 12. A business team needs to create interactive dashboards and visualizations from data stored in Redshift, Athena, and S3 without managing any BI server infrastructure. Which AWS service should they use?
Câu 13. A company is building a data lake on S3 with data from multiple sources. They need fine-grained access control (column-level and row-level security) for different teams querying the data through Athena and Redshift Spectrum. Which service provides centralized data lake governance?
Câu 14. A company's Redshift cluster experiences slow query performance during peak hours when many users run concurrent queries. Adding more nodes is not cost-effective for occasional peaks. Which Redshift feature automatically adds transient capacity to handle query bursts?
Câu 15. A company has three different consumer applications that need to read from the same Kinesis Data Stream independently and in real-time. Using the standard shared throughput model, consumers compete for the 2 MB/s per-shard read limit. How can they ensure each consumer gets dedicated throughput?