Top 35 ETL Interview Questions & Answers 2022

TOP 15 ETL Testing Interview Questions and Answers 2019 Part-1 | ETL Testing | Wisdom Jobs

26. What is a session in ETL?

A session is a set of instructions that describe the data movement from the source to the destination.

etl interview questions

In data warehousing architecture, ETL is an important component, which manages the data for any business process. ETL stands for Extract, Transform and Load. Extract does the process of reading data from a database. Transform does the converting of data into a format that could be appropriate for reporting and analysis. While, load does the process of writing the data into the target database.

Data mining can be define as the process of extracting hidden predictive information from large databases and interpret the data while data warehousing may make use of a data mine for analytical processing of the data in a faster way. Data warehousing is the process of aggregating data from multiple sources into one common repository

12. What are the characteristics of snapshots?

Snapshots are located on remote nodes and refreshed periodically so that the changes in the master table can be recorded. They are also the replica of tables.

27. What is meant by Worklet in ETL?

Worklet is a set of tasks in ETL. It can be any set of tasks in the program.

33. When are the tables in ETL analyzed?

The use of the ANALYZE statement allows the validation and computing of statistics for either the index table or the cluster.

18. Which partition is used to improve the performances of ETL transactions?

To improve the performances of ETL transactions, the session partition is used.

34. How are the tables analyzed in ETL?

Statistics generated by the ANALYZE statement is reused by a cost-based optimizer in order to calculate the most efficient plan for data retrieval. The ANALYZE statement can support the validation of structures of objects, as well as space management, in the system. Operations include COMPUTER, ESTIMATE, and DELETE.

Top Answers to ETL Interview Questions


ETL stands for extract, transform, and load. These are the three functions of databases that are combined into a single tool such that you can take out data from a particular database and store or keep it in another. This ETL Interview Questions blog has a compiled list of questions that are most generally asked during interviews. Prepare the ETL interview questions listed below and get ready to crack your job interview:

This ETL Interview Questions blog is broadly divided into the categories mentioned below: 1. Basic

Watch this Informatica Full Course video:

2. What is an ETL process?

ETL is the process of Extraction, Transformation, and Loading.

Learn more about Business Objects vs Informatica in this insightful blog!

1. Compare between ETL and ELT.

Criteria ETL ELT
Flexibility High Low
Working methodology Data from the source system to the data warehouse Leverages the target system to transform data
Performance Average Good

Toptal sourced essential questions that the best ETL developers and engineers can answer. Driven from our community, we encourage experts to submit questions and offer feedback.

  • Im hiring
  • I’m looking for work
  • What are the most common transformations in ETL processes?

    Although the list might get very long, there are some basic steps performed during the ETL process that every ETL developer should mention immediately: data conversion, aggregation, deduplication, and filtering.

    Other options a candidate might mention include:

  • Data cleaning
  • Formatting
  • Merging/joining
  • Calculating new fields
  • Sorting
  • Pivoting
  • Lookup operations
  • Data validation
  • What is a “staging” area, why is it needed?

    Staging is an optional, intermediate storage area in ETL processes.

    The decision “to stage or not to stage” can be split into four main considerations:

  • Auditing purposes. Thanks to the staging area we are able to compare the original input file with our outcome. It’s extremely useful, especially when the source system overwrites the history (e.g., flat files on an FTP server are being overwritten every day.)
  • Recovery needs. Even though PCs are getting faster—typically having more bandwidth of nearly every form—there are still some legacy systems and environments that are not performant to extract data from. It’s good practice to store the data as soon as it’s extracted from the source system. This way staging objects can act as recovery checkpoints, avoiding the situation where a process needs to be completely rerun when it fails at 90 percent done.
  • Backup. In the case of a failure, the staging area can be used to recover data in a target system.
  • Load performance. If the data has to be loaded as soon as possible into the system, staging is the way to go. Developers load data as-is into the staging area, then perform various transformations on it from there. It’s far more efficient than transforming the data on-the-fly before loading it into the target system, but the tradeoff here is higher disk space usage.
  • How would you prepare and develop incremental loads?

    The most common way to prepare for incremental load is to use information about the date and time a record was added or modified. It can be designed during the initial load and maintained later, or added later in an ETL process based on business logic.

    It is very important to make sure that the fields used for this are not modified during the process and that these can be trusted.

    The next step is to decide how to capture the changes, but the underlying basics are always the same: Comparing the last modified date to the maximum date already existing in the target, and then taking all records that are larger.

    Another option is to prepare a process for delta loads, which would compare already existing records with new ones, and only load the differences. But this is not the most efficient way.

    Apply to Join Toptals Development Network

    and enjoy reliable, steady, remote Freelance ETL Developer Jobs

    What is the advantage of third-party tools like SSIS compared to SQL scripts?

    Third-party tools offer faster and simpler development. Thanks to their GUIs, these tools can also be used by people who are not technical experts but have wide knowledge about the business itself.

    ETL tools are able to generate metadata automatically and have predefined connectors for most sources. One of the most important features is also the ability to join data from multiple files on the fly.

    How would you update a big table, i.e. one having over 10 million rows?

    The most common way is to use batches: Split one big query into smaller ones, e.g. 10 thousand rows in a batch:

    If the table is too big, a better option might be to create a new table, insert the changed data, and then switch tables.

    What are the disadvantages of indexes?

    Indexes allow fast lookups, but they decrease load performance: Heavily indexed tables will not allow effective DML operations, i.e. insertions and updates.

    It’s worth noting that indexes take additional disk space. Worse, though, the database back end needs to update all relevant indexes whenever data changes. It also creates additional overhead due to index fragmentation: Developers or DBAs have to take care of index maintenance, reorganization, and rebuilds.

    Index fragmentation causes serious performance issues. When new data is inserted into an index, the database engine has to find space for it. It might happen that the new data insert messes up the current order—the SQL engine might split the data from a single data page, which creates an excessive amount of free space (internal fragmentation). It might also mess up the current page order, which forces the SQL engine to jump between pages when reading data from the disk. All of this creates additional overhead to the process of reading the data and forces random disk I/O.

    Microsoft, in particular, recommends reorganizing an index when index fragmentation is between 5 and 30 percent, and to rebuild it when it’s greater than 30 percent. An index rebuild in SQL Server creates another index underneath and then replaces the previous one. Rebuilding may block the whole table from reading it (when using an edition other than their Enterprise offering.) Index reorganization is basically a reordering of leaf-pages and an attempt to compact data pages.

    What’s better from a performance point of view: Filtering data first and then joining it with other sources, or joining it first and then filtering?

    It’s better to filter data first and then join it with other sources.

    A good way to improve ETL process performance is to get rid of unwanted data as soon as possible in the process. It reduces the time spent on data transfer and/or I/O and memory processing.

    The general rule is to reduce the number of processed rows and to avoid transforming data that never gets to the target.

    How would you prepare logging for ETL process?

    Logging is extremely important to keep track of all changes and failures during a load. The most common ways to prepare for logging are to use flat files or a logging table. That is, during the process, counts, timestamps, and metadata about the source and target are added and then dumped into a flat file or table.

    This way the load can be checked for invalid runs. When such a table or file exists, the next step would be to prepare notifications. This could be a report, or a simple formatted email, describing the load as soon as it finishes (e.g. the number of processed records compared to the previous load.)

    To achieve that in an ETL process, a developer would add event handlers (SSIS) or use variables (like the system variable @@ROWCOUNT in Transact-SQL) to keep track of inserted, updated, and deleted records.

    When using SSIS we can also keep track of every processed package using the SSISDB database:

    Other than flat files and the database itself, ETL tools offer native notification and logging features: for example, a dashboard indicating the current load status.

    What is the purpose of data profiling in an ETL process? Which steps in the data profiling process are the most important?

    Data profiling tasks help maintain data quality. During this phase, a number of issues are checked for and resolved. The most important are:

  • Keys and unique identification of a row. Rows to be inserted must be unique. Often businesses use some natural keys to identify a given row, but developers must verify that this is sufficient.
  • Data types. Column names that suggest a certain type should be scrutinized: Will the indicated type change the meaning of the column, or potentially allow for data loss? Data types can also affect post-ETL performance: Even if it doesn’t matter much during the process, text loaded into a variable-length string column will, on some RDBMSes, cause a performance hit when users start querying the target.
  • Relationships among data. It’s important to know how tables relate to each other. It might require additional modeling to join some parts of data to avoid losing important structural information. Another thing is to understand the cardinality of a relationship, as it determines how the tables involved are going to be joined in the future.
  • What are three different approaches to implementing row versioning?

    Maintaining row history requires implementing a versioning policy. The three most popular types are:

  • Insert a new record: In this case, updated information about the row is stored, but it’s not linked to any other information—it’s treated as a new row. Usually, in this case, there is also an additional column (or even more than one) to easily identify the most recent change. It could be, for example, a “current record” flag, a “reason for change” text field, or a “valid from/until” pair of datetimes (or tsrange, perhaps).
  • Additional column(s): Here the old value of a changed column is moved to the additional column (e.g. old_amount) and the new value takes the place of the original (e.g. amount.)
  • History table: First a history table is created, separate from the primary table. Then we have multiple options for how to load data into this table. One of them is to create DML triggers. Functionality provided by RDBMS vendors—like change data capture features—can be handy here. Such features can be far more efficient than triggers, like when they keep track of changes directly in the transaction log, which is responsible for keeping information about any changes made to the database. SQL Server—in particular, 2016 and beyond—can track changes using system-versioned temporal tables. This feature maintains a full history table next to the most current one: The main temporal table keeps only the most recent version of the data, but it is linked to the history table, which contains all previous versions.
  • There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

    Tired of interviewing candidates? Not sure what to ask to get you a top hire?

    Let Toptal find the best people for you.

    Our Exclusive Network of ETL Developers

    Looking to land a job as an ETL Developer?

    Let Toptal find the right job for you.

    Real ETL Jobs From Our Network

    Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

    Giovani is an experienced data professional with extensive expertise in data systems such as SQL Server, PostgreSQL, MySQL, and DB2. He is also very familiar with the Microsoft BI stack including SSRS, SSIS, SSAS, Power BI, Azure DW, and Azure DB. He also has AWS experience including RDS, Aurora, DynamoDB, S3, EC2, CloudFormation, Lambda Functions, Step Functions, and VPC set up. He communicates very well and has worked in teams of all sizes.

    Paul is a seasoned software developer with over a decade of experience and a focus on Scala, Play, Akka, and Apache Spark. He develops safe, maintainable software that is robust against unintended bugs and transparent for monitoring and diagnostics. He uses computational resources efficiently, developing reactive applications that dont clog up threads.

    Sam is a database and business intelligence expert with 10+ years of experience in data architecture, analytics, and reporting, and a strong background in finance, sales, and operations. Hes detail-oriented with excellent communication skills. If your project involves data and complex requirements, he can do it with his eyes closed. Look no further for someone who shares your values for quality and attention to detail.

    Looking for ETL Developers? Check out Toptal’s ETL developers.

    Toptal Connects the Top 3% of Freelance Talent All Over The World.

    Related Posts

    Leave a Reply

    Your email address will not be published. Required fields are marked *