TOP 15 ETL Testing Interview Questions and Answers 2019 Part-1 | ETL Testing | Wisdom Jobs
26. What is a session in ETL?
A session is a set of instructions that describe the data movement from the source to the destination.
etl interview questions
In data warehousing architecture, ETL is an important component, which manages the data for any business process. ETL stands for Extract, Transform and Load. Extract does the process of reading data from a database. Transform does the converting of data into a format that could be appropriate for reporting and analysis. While, load does the process of writing the data into the target database.
Data mining can be define as the process of extracting hidden predictive information from large databases and interpret the data while data warehousing may make use of a data mine for analytical processing of the data in a faster way. Data warehousing is the process of aggregating data from multiple sources into one common repository
12. What are the characteristics of snapshots?
Snapshots are located on remote nodes and refreshed periodically so that the changes in the master table can be recorded. They are also the replica of tables.
27. What is meant by Worklet in ETL?
Worklet is a set of tasks in ETL. It can be any set of tasks in the program.
33. When are the tables in ETL analyzed?
The use of the ANALYZE statement allows the validation and computing of statistics for either the index table or the cluster.
18. Which partition is used to improve the performances of ETL transactions?
To improve the performances of ETL transactions, the session partition is used.
34. How are the tables analyzed in ETL?
Statistics generated by the ANALYZE statement is reused by a cost-based optimizer in order to calculate the most efficient plan for data retrieval. The ANALYZE statement can support the validation of structures of objects, as well as space management, in the system. Operations include COMPUTER, ESTIMATE, and DELETE.
Top Answers to ETL Interview Questions
CTA
ETL stands for extract, transform, and load. These are the three functions of databases that are combined into a single tool such that you can take out data from a particular database and store or keep it in another. This ETL Interview Questions blog has a compiled list of questions that are most generally asked during interviews. Prepare the ETL interview questions listed below and get ready to crack your job interview:
This ETL Interview Questions blog is broadly divided into the categories mentioned below: 1. Basic
Watch this Informatica Full Course video:
2. What is an ETL process?
ETL is the process of Extraction, Transformation, and Loading.
Learn more about Business Objects vs Informatica in this insightful blog!
1. Compare between ETL and ELT.
Criteria | ETL | ELT |
Flexibility | High | Low |
Working methodology | Data from the source system to the data warehouse | Leverages the target system to transform data |
Performance | Average | Good |
Toptal sourced essential questions that the best ETL developers and engineers can answer. Driven from our community, we encourage experts to submit questions and offer feedback.
What are the most common transformations in ETL processes?
Although the list might get very long, there are some basic steps performed during the ETL process that every ETL developer should mention immediately: data conversion, aggregation, deduplication, and filtering.
Other options a candidate might mention include:
What is a “staging” area, why is it needed?
Staging is an optional, intermediate storage area in ETL processes.
The decision “to stage or not to stage” can be split into four main considerations:
How would you prepare and develop incremental loads?
The most common way to prepare for incremental load is to use information about the date and time a record was added or modified. It can be designed during the initial load and maintained later, or added later in an ETL process based on business logic.
It is very important to make sure that the fields used for this are not modified during the process and that these can be trusted.
The next step is to decide how to capture the changes, but the underlying basics are always the same: Comparing the last modified date to the maximum date already existing in the target, and then taking all records that are larger.
Another option is to prepare a process for delta loads, which would compare already existing records with new ones, and only load the differences. But this is not the most efficient way.
Apply to Join Toptals Development Network
and enjoy reliable, steady, remote Freelance ETL Developer Jobs
What is the advantage of third-party tools like SSIS compared to SQL scripts?
Third-party tools offer faster and simpler development. Thanks to their GUIs, these tools can also be used by people who are not technical experts but have wide knowledge about the business itself.
ETL tools are able to generate metadata automatically and have predefined connectors for most sources. One of the most important features is also the ability to join data from multiple files on the fly.
How would you update a big table, i.e. one having over 10 million rows?
The most common way is to use batches: Split one big query into smaller ones, e.g. 10 thousand rows in a batch:
If the table is too big, a better option might be to create a new table, insert the changed data, and then switch tables.
What are the disadvantages of indexes?
Indexes allow fast lookups, but they decrease load performance: Heavily indexed tables will not allow effective DML operations, i.e. insertions and updates.
It’s worth noting that indexes take additional disk space. Worse, though, the database back end needs to update all relevant indexes whenever data changes. It also creates additional overhead due to index fragmentation: Developers or DBAs have to take care of index maintenance, reorganization, and rebuilds.
Index fragmentation causes serious performance issues. When new data is inserted into an index, the database engine has to find space for it. It might happen that the new data insert messes up the current order—the SQL engine might split the data from a single data page, which creates an excessive amount of free space (internal fragmentation). It might also mess up the current page order, which forces the SQL engine to jump between pages when reading data from the disk. All of this creates additional overhead to the process of reading the data and forces random disk I/O.
Microsoft, in particular, recommends reorganizing an index when index fragmentation is between 5 and 30 percent, and to rebuild it when it’s greater than 30 percent. An index rebuild in SQL Server creates another index underneath and then replaces the previous one. Rebuilding may block the whole table from reading it (when using an edition other than their Enterprise offering.) Index reorganization is basically a reordering of leaf-pages and an attempt to compact data pages.
What’s better from a performance point of view: Filtering data first and then joining it with other sources, or joining it first and then filtering?
It’s better to filter data first and then join it with other sources.
A good way to improve ETL process performance is to get rid of unwanted data as soon as possible in the process. It reduces the time spent on data transfer and/or I/O and memory processing.
The general rule is to reduce the number of processed rows and to avoid transforming data that never gets to the target.
How would you prepare logging for ETL process?
Logging is extremely important to keep track of all changes and failures during a load. The most common ways to prepare for logging are to use flat files or a logging table. That is, during the process, counts, timestamps, and metadata about the source and target are added and then dumped into a flat file or table.
This way the load can be checked for invalid runs. When such a table or file exists, the next step would be to prepare notifications. This could be a report, or a simple formatted email, describing the load as soon as it finishes (e.g. the number of processed records compared to the previous load.)
To achieve that in an ETL process, a developer would add event handlers (SSIS) or use variables (like the system variable @@ROWCOUNT
in Transact-SQL) to keep track of inserted, updated, and deleted records.
When using SSIS we can also keep track of every processed package using the SSISDB
database:
Other than flat files and the database itself, ETL tools offer native notification and logging features: for example, a dashboard indicating the current load status.
What is the purpose of data profiling in an ETL process? Which steps in the data profiling process are the most important?
Data profiling tasks help maintain data quality. During this phase, a number of issues are checked for and resolved. The most important are:
What are three different approaches to implementing row versioning?
Maintaining row history requires implementing a versioning policy. The three most popular types are:
datetime
s (or tsrange
, perhaps).old_amount
) and the new value takes the place of the original (e.g. amount
.)There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.
Tired of interviewing candidates? Not sure what to ask to get you a top hire?
Let Toptal find the best people for you.
Our Exclusive Network of ETL Developers
Looking to land a job as an ETL Developer?
Let Toptal find the right job for you.
Real ETL Jobs From Our Network
Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.
Giovani is an experienced data professional with extensive expertise in data systems such as SQL Server, PostgreSQL, MySQL, and DB2. He is also very familiar with the Microsoft BI stack including SSRS, SSIS, SSAS, Power BI, Azure DW, and Azure DB. He also has AWS experience including RDS, Aurora, DynamoDB, S3, EC2, CloudFormation, Lambda Functions, Step Functions, and VPC set up. He communicates very well and has worked in teams of all sizes.
Paul is a seasoned software developer with over a decade of experience and a focus on Scala, Play, Akka, and Apache Spark. He develops safe, maintainable software that is robust against unintended bugs and transparent for monitoring and diagnostics. He uses computational resources efficiently, developing reactive applications that dont clog up threads.
Sam is a database and business intelligence expert with 10+ years of experience in data architecture, analytics, and reporting, and a strong background in finance, sales, and operations. Hes detail-oriented with excellent communication skills. If your project involves data and complex requirements, he can do it with his eyes closed. Look no further for someone who shares your values for quality and attention to detail.
Looking for ETL Developers? Check out Toptal’s ETL developers.
Toptal Connects the Top 3% of Freelance Talent All Over The World.