Q : What is Datastage?
A : DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.
Q : What is a conductor node in DataStage?
A : Actually every process contains a conductor process where the execution was started and a section leader process for each processing node and a player process for each set of combined operators and a individual player process for each uncombined operator.
When ever we want to kill a process we should have to destroy the player process and then section leader process and then conductor process.
Q : Explain the DataStage parallel Extender or Enterprise Edition (EE)?
A : Parallel extender in DataStage is the data extraction and transformation application for the parallel processing.
There are two types of parallel processing’s are available they are:
- Pipeline Parallelism
- Partition Parallelism
Q : How do you run datastage job from command line?
A : Using “dsjob” command as follows.
dsjob -run -jobstatus projectname jobname
Q : What are the different options associated with “dsjob” command?
A : ex:$dsjob -run and also the options like
-stop -To stop the running job
-lprojects – To list the projects
-ljobs – To list the jobs in project
-lstages – To list the stages present in job.
-llinks – To list the links.
-projectinfo – returns the project information(hostname and project name)
-jobinfo – returns the job information(Job status,job runtime,endtime, etc.,)
-stageinfo – returns the stage name ,stage type,input rows etc.,)
-linkinfo – It returns the link information
-lparams – To list the parameters in a job
-paraminfo – returns the parameters info
-log – add a text message to log.
-logsum – To display the log
-logdetail – To display with details like event_id,time,messge
-lognewest – To display the newest log id.
-report – display a report contains Generated time, start time,elapsed time,status etc.,
-jobid – Job id information.
Q : What is IBM DataStage Flow Designer?
A : IBM DataStage Flow Designer is a web-based user interface for DataStage. You can use it to create, edit, load, and run DataStage jobs.
Q : Can you explain difference between sequential file,dataset and fileset?
A : Sequential File:
- Extract/load from/to seq file max 2GB
- When used as a source at the time of compilation it will be converted into native format from ASCII
- Does not support null values
- Seq file can only be accessed on one node.
Dataset:
- It preserves partition.it stores data on the nodes so when you read from a dataset you dont have to repartition the data
- It stores data in binary in the internal format of datastage. so it takes less time to read/write from ds to any other source/target.
- You cannot view the data without datastage.
- It Creates 2 types of file to storing the data.
- Descriptor File : Which is created in defined folder/path.
- Data File : Created in Dataset folder mentioned in configuration file.
- Dataset (.ds) file cannot be open directly, and you could follow alternative way to achieve that, Data Set Management, the utility in client tool(such as Designer and Manager), and command line ORCHADMIN.
Fileset:
- It stores data in the format similar to that of sequential file.Only advantage of using fileset over seq file is it preserves partition scheme.
- you can view the data but in the order defined in partitioning scheme.
- Fileset creates .fs file and .fs file is stored as ASCII format, so you could directly open it to see the path of data file and its schema.
Q : What are the features of DataStage Flow Designer?
A : Flow Designer Features
IBM DataStage Flow Designer has many features to enhance your job building experience.
We can use the palette to drag and drop connectors and operators on to the designer canvas.
We can link nodes by selecting the previous node and dropping the next node, or drawing the link between the two nodes.
We can edit stage properties on the side-bar, and make changes to your schema in Column Properties tab.
We can zoom in and zoom out using your mouse, and leverage the mini-map on the lower-right of the window to focus on a particular part of the DataStage job.
This is very useful when you have a very large job with tens or hundreds of stages.
Q : What are the benefits of Flow Designer?
A : There are many benefits with Flow designer, they are:
No need to migrate jobs – You do not need to migrate jobs to a new location in order to use the new web-based IBM DataStage Flow Designer user interface.
No need to upgrade servers and purchase virtualization technology licenses – Getting rid of a thick client means getting rid of keeping up with the latest version of software, upgrading servers, and purchasing Citrix licenses. IBM DataStage Flow Designer saves time AND money!
Easily work with your favorite jobs – You can mark your favorite jobs in the Jobs Dashboard, and have them automatically show up on the welcome page. This gives you a fast, one-click access to jobs that are typically used for reference, saving you navigation time.
Easily continue working where you left off – Your recent activity automatically shows up on the welcome page. This gives you a fast, one-click access to jobs that you were working on before, so you can easily start where you left off in the last session.
Efficiently search any job – Many organizations have thousands of DataStage jobs. You can very easily find your job with the built-in type ahead Search feature on the Jobs Dashboard.
Cloning a job – Instead of always starting Job Design from scratch, you can clone an existing job on the Jobs Dashboard and use that to jump-start your new Job Design.
Automatic metadata propagation – IBM DataStage Flow Designer comes with a powerful feature to automatically propagate metadata. Once you add a source connector to your job and link it to an operator, the operator automatically inherits the metadata. You do not have to specify the metadata in each stage of the job.
Storing your preferences – You can easily customize your viewing preferences and have the IBM DataStage Flow Designer automatically save them across sessions.
Saving a job – IBM DataStage Flow Designer allows you to save a job in any folder. The job is saved as a DataStage job in the repository, alongside other jobs that might have been created using the DataStage Designer thick client.
Highlighting of all compilation errors – The DataStage thick client identifies compilation errors one at a time. Large jobs with many stages can take longer to troubleshoot in this situation. IBM DataStage Flow Designer highlights all errors and gives you a way to see the problem with a quick hover over each stage, so you can fix multiple problems at the same time before recompiling.
Running a job – IBM DataStage Flow Designer allows you to run a job. You can refresh the status of your job on the new user interface. You can also view the Job Log, or launch the Ops Console to see more details of job execution.
Q : What is Hive connector?
A : Hive connector supports modulus partition mode and minimum maximum partition mode during the read operation.
Q : What is HBase connector?
A : HBase connector is used to connect to tables stored in the HBase database and perform the following operations:
- Read data from or write data to HBase database.
- Read data in parallel mode.
- Use HBase table as a lookup table in sparse or normal mode.
Q : What is Kafka connector?
A : Kafka connector has been enhanced with the following new capabilities:
Continuous mode, where incoming topic messages are consumed without stopping the connector.
Transactions, where a number of Kafka messages is fetched within a single transaction. After record count is reached, an end of wave marker is sent to the output link.
TLS connection to Kafka.
Kerberos keytab locality is supported.
Q : What is File connector?
A : File connector has been enhanced with the following new capabilities:
- Native HDFS FileSystem mode is supported.
- You can import metadata from the ORC files.
- New data types are supported for reading and writing the Parquet formatted files: Date / Time and Timestamp.
Q : What is Amazon S3 connector?
A : Amazon S3 connector now supports connecting by using a HTTP proxy server.
Q : Explain is Infosphere Information Server?
A : InfoSphere Information Server is capable of scaling to meet any information volume requirement so that companies can deliver business results faster and with higher quality results. InfoSphere Information Server provides a single unified platform that enables companies to understand, cleanse, transform, and deliver trustworthy and context-rich information.
Q : What is Client tier in Information server?
A : The client tier includes the client programs and consoles that are used for development and administration, and the computers where they are installed.
Q : What are the different Tiers available in InfoSphere Information Server?
A : In InfoSphere information server there are four tiers are available, they are:
- Client Tier
- Engine Tier
- Services Tier
- Metadata Repository Tier
Q : What is Engine tier in Information server?
A : The engine tier includes the logical group of components (the InfoSphere Information Server engine components, service agents, and so on) and the computer where those components are installed. The engine runs jobs and other tasks for product modules.
Q : What is Services tier in Information server?
A : The services tier includes the application server, common services, and product services for the suite and product modules, and the computer where those components are installed. The services tier provides common services (such as metadata and logging) and services that are specific to certain product modules. On the services tier, WebSphere® Application Server hosts the services. The services tier also hosts InfoSphere Information Server applications that are web-based.
Q : Metadata repository tier in Information server?
A : The metadata repository tier includes the metadata repository, the InfoSphere Information Analyzer analysis database (if installed), and the computer where these components are installed. The metadata repository contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules. The analysis database stores extended analysis data for InfoSphere Information Analyzer.
Q : What are the key elements of Datastage?
A : DataStage provides the elements that are necessary to build data integration and transformation flows.
These elements include
- Stages
- Links
- Jobs
- Table definitions
- Containers
- Sequence jobs
- Projects
Q : What are Links in Datastage?
A : A link is a representation of a data flow that joins the stages in a job. A link connects data sources to processing stages, connects processing stages to each other, and also connects those processing stages to target systems. Links are like pipes through which the data flows from one stage to the next.
Q : What are Stages in Datastage?
A : Stages are the basic building blocks in InfoSphere DataStage, providing a rich, unique set of functionality that performs either a simple or advanced data integration task. Stages represent the processing steps that will be performed on the data.
Q : What are Jobs in Datastage?
A : Jobs include the design objects and compiled programmatic elements that can connect to data sources, extract and transform that data, and then load that data into a target system. Jobs are created within a visual paradigm that enables instant understanding of the goal of the job.
Q : What are Sequence jobs in Datastage?
A : A sequence job is a special type of job that you can use to create a workflow by running other jobs in a specified order. This type of job was previously called a job sequence.
Q : What are Table definitions?
A : Table definitions specify the format of the data that you want to use at each stage of a job. They can be shared by all the jobs in a project and between all projects in InfoSphere DataStage. Typically, table definitions are loaded into source stages. They are sometimes loaded into target stages and other stages.
Q : What are Projects in Datastage?
A : A project is a container that organizes and provides security for objects that are supplied, created, or maintained for data integration, data profiling, quality monitoring, and so on.
Q : What are Containers in Datastage?
A : Containers are reusable objects that hold user-defined groupings of stages and links. Containers create a level of reuse that allows you to use the same set of logic several times while reducing the maintenance. Containers make it easy to share a workflow, because you can simplify and modularize your job designs by replacing complex areas of the diagram with a single container.
Q : What is Parallel processing design?
A : InfoSphere DataStage brings the power of parallel processing to the data extraction and transformation process. InfoSphere DataStage jobs automatically inherit the capabilities of data pipelining and data partitioning, allowing you to design an integration process without concern for data volumes or tim constraints, and without any requirements for hand coding.
Q : What are the types of parallel processing?
A : InfoSphere DataStage jobs use two types of parallel processing:
- Data pipelining
- Data partitioning
Q : What is Data pipelining?
A : Data pipelining is the process of extracting records from the data source system and moving them through the sequence of processing functions that are defined in the data flow that is defined by the job. Because records are flowing through the pipeline, they can be processed without writing the records to disk.
Q : What is Data partitioning?
A : Data partitioning is an approach to parallelism that involves breaking the records into partitions, or subsets of records. Data partitioning generally provides linear increases in application performance.
When you design a job, you select the type of data partitioning algorithm that you want to use (hash, range, modulus, and so on). Then, at runtime, InfoSphere DataStage uses that selection for the number of degrees of parallelism that are specified dynamically at run time through the configuration file.
Q : What are Operators in Datastage?
A : A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job.
Q : What are Players in Datastage?
A : Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).
Q : What is OSH in Datastage?
A : OSH is the scripting language used internally by the parallel engine.
Q : What are the two major ways of combining data in an InfoSphere DataStage Job? How do you decide which one to use?
A : the two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a Join stage
- Lookup and Join stages perform equivalent operations: combining two or more input data sets based on one or more specified keys. When one unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution.
- The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. The Lookup stage requires all but the first input (the primary input) to fit into physical memory.
Q : What is Link buffering?
A : InfoSphere DataStage automatically performs buffering on the links of certain stages. This is primarily intended to prevent deadlock situations arising (where one stage is unable to read its input because a previous stage in the job is blocked from writing to its output).
Q : What is the advantage of using Modular development in data stage?
A : We should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time.
Q : How do you import and export data into Datastage?
A : Here are the points how to import and export data into datastage
- The import/export utility consists of these operators:
- The import operator: imports one or more data files into a single data set.
- The export operator: exports a data set to one or more data files.
Q : What is the collection library in Datastage?
A : The collection library is a set of related operators that are concerned with collecting partitioned data.
Q : What are the collectors available in collection library?
A : The collection library contains three collectors:
- The ordered collector
- The roundrobin collector
- The sortmerge collector
Q : What is the ordered collector?
A : The Ordered collector reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves the sorted order of an input data set that has been totally sorted. In a totally sorted data set, the records in each partition of the data set, as well as the partitions themselves, are ordered. .
Q : What is the roundrobin collector?
A : The roundrobin collector reads a record from the first input partition, then from the second partition, and so on. After reaching the last partition, the collector starts over. After reaching the final record in any partition, the collector skips that partition. .
Q : What is the sortmerge collector?
A : The sortmerge collector reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys.
Q : What is aggtorec restructure operator and what it does?
A : aggtorec restructure operator groups records that have the same key-field values into an output record
Q : What is field_export restructure operator and what it does?
A : field_export restructure operator combines the input fields specified in your output schema into a string- or raw-valued field
Q : What is field_import restructure operator and what it does?
A : field_import restructure operator exports an input string or raw field to the output fields specified in your import schema.
Q : What is makesubrec restructure operator and what it does?
A : makesubrec restructure operator combines specified vector fields into a vector of subrecords
Q : What is makevect restructure operator and what it does?
A : makevect restructure operator combines specified fields into a vector of fields of the same type
Q : What is promotesubrec restructure operator and what it does?
A : promotesubrec restructure operator converts input subrecord fields to output top-level fields
Q : What is splitvect restructure operator and what it does?
A : splitvect restructure operator promotes the elements of a fixed-length vector to a set of similarly-named top-level fields
Q : What is splitsubrec restructure operator and what it does?
A : splitsubrec restructure operator separates input subrecords into sets of output top-level vector fields
Q : What is tagbatch restructure operator and what it does?
A : tagbatch restructure operator converts tagged fields into output records whose schema supports all the possible fields of the tag cases.
Q : What is tagswitch restructure operator and what it does?
A : The contents of tagged aggregates are converted to InfoSphere DataStage-compatible records.
Q : How do you print/display the last line of a file?
A : The easiest way is to use the [tail] command.
$> tail -1 file.txt
If you want to do it using [sed] command, here is what you should write:
$> sed -n '$ p' test
Q : How do you print/display the first line of a file?
A : The easiest way to display the first line of a file is using the [head] command.
$> head -1 file.txt
If you specify [head -2] then it would print first 2 records of the file.
Another way can be by using [sed] command. [Sed] is a very powerful text editor which can be used for various text manipulation purposes like this.
$> sed '2,$ d' file.txt
Q : How to display n-th line of a file?
A : The easiest way to do it will be by using [sed] command
$> sed –n ' p' file.txt
You need to replace with the actual line number. So if you want to print the 4th line, the command will be
$> sed –n '4 p' test
Of course you can do it by using [head] and [tail] command as well like below:
$> head - file.txt | tail -1
You need to replace with the actual line number. So if you want to print the 4th line, the command will be
$> head -4 file.txt | tail -1
Q : How to remove the first line / header from a file?
A : We already know how [sed] can be used to delete a certain line from the output – by using the’d’ switch. So if we want to delete the first line the command should be:
$> sed '1 d' file.txt
But the issue with the above command is, it just prints out all the lines except the first line of the file on the standard output. It does not really change the file in-place. So if you want to delete the first line from the file itself, you have two options.
Either you can redirect the output of the file to some other file and then rename it back to original file like below:
$> sed '1 d' file.txt > new_file.txt
$> mv new_file.txt file.txt
Or, you can use an inbuilt [sed] switch ‘–i’ which changes the file in-place. See below:
$> sed –i '1 d' file.txt
Q : How to remove certain lines from a file in Unix?
A : If you want to remove line to line from a given file, you can accomplish the task in the similar method shown above. Here is an example:
$> sed –i '5,7 d' file.txt
The above command will delete line 5 to line 7 from the file file.txt
Q : How to remove the last line/ trailer from a file in Unix script?
A : Always remember that [sed] switch ‘$’ refers to the last line. So using this knowledge we can deduce the below command:
$> sed –i '$ d' file.txt