#builddata | Explore Tumblr posts and blogs

govindhtech · 1 year ago

Text

AWS Glue Studio: Build Data Pipelines Without Writing Code

With AWS Glue Studio, integrate your data and work together by leveraging data preparation.

What is AWS Glue Studio?

The purpose of AWS Glue Studio, a visual interface found within the AWS Glue service, is to facilitate the creation, execution, and oversight of ETL (Extract, Transform, Load) jobs for data scientists, engineers, and developers. The fully managed ETL solution AWS Glue makes data preparation and loading for analytics easy.

AWS Glue Studio tutorial

AWS is pleased to inform that AWS Glue Studio Visual ETL now offers data preparation authoring on a broad basis. With a spreadsheet-style user interface, this new no-code data preparation tool for business users and data analysts executes data integration tasks at scale on AWS Glue for Spark. It is now simpler for data scientists and analysts to clean and transform data in order to get it ready for analytics and machine learning (ML) thanks to the new visual data preparation experience. With this new experience, you can automate data preparation chores without writing any code by selecting from hundreds of pre-built transforms.

A script that connects to your source data, processes it, and then writes it to your data target is contained in an AWS Glue job. Extract, transform, and load (ETL) scripts are typically run by a job. Scripts created for the Ray and Apache Spark runtime environments can be executed by jobs. General-purpose Python scripts (Python shell jobs) can also be executed by jobs. Jobs can be started by AWS Glue triggers on demand or in response to an event or schedule. To comprehend runtime metrics like completion status, duration, and start time, you can keep an eye on work runs.

Scripts generated by AWS Glue can be used, or you can supply your own. The AWS Glue Studio code generator can generate an Apache Spark API (PySpark) script automatically given a source schema and a target location or schema. This script can be edited to fit your needs, or you can use it as a starting point.

Multiple data formats can be written to output files using Amazon Glue. Different output formats may be supported by each type of job. Common compression formats can be designed for specific data formats.

Authenticating into the AWS Glue interface

The business logic that carries out extract, transform, and load (ETL) tasks is called a job in Amazon Glue. The AWS Glue console’s ETL section is where tasks can be created.

Open the AWS Glue console. After logging into the AWS Management Console to examine the jobs that are currently in progress. Next, select the Jobs tab in Amazon Glue. The Jobs list shows the current job bookmark option, the latest modification date, and the location of the script associated with each job.

You can use Amazon Glue Studio to edit your ETL jobs either during the creation of a new job or after you have saved your job. This can be accomplished by modifying the job script in developer mode or by modifying the nodes in the visual editor. Additionally, the visual editor allows you to add and remove nodes to design more complex ETL tasks.

The following actions to create a job in AWS Glue Studio

Nodes for your job are configured using the visual job editor. Every node stands for a different action, such as reading data from its original source or transforming it. There are characteristics on every node you add to your task that tell you about the transform or the location of the data.

Data engineers and business analysts can now work together to create data integration projects. Data engineers can specify connections to the data and configure the data flow process’s ordering using the visual flow-based interface in Glue Studio. Business analysts can specify the data transformation and output by drawing on their experience with data preparation. You can also import your current “recipes” for data cleansing and preparation from AWS Glue DataBrew into the new AWS Glue data preparation experience. This allows you to keep writing them straight in AWS Glue Studio and then scale up recipes to handle petabytes of data at a fraction of the cost of AWS Glue jobs.

Prerequisites for Visual ETL

An AWSGlueConsoleFullAccess IAM managed policy linked to the users and roles that will access AWS Glue is required for the visual ETL. These roles and users have read access to Amazon Simple Storage Service (Amazon S3) resources and full access to AWS Glue thanks to this policy.

Sophisticated visual ETL flows

Use AWS Glue Studio to author the visual ETL after the necessary AWS Identity and Access Management (IAM) role permissions have been established.

Excerpt

Choose the Amazon S3 node from the list of Sources to create an Amazon S3 node. Choose the recently established node and search for an S3 dataset. After the file has been properly uploaded, select Infer schema to set the source node. A glimpse of the data in the.csv file will appear in the visual interface.

In order to visualise the data, I first created an S3 bucket in the same region as the AWS Glue visual ETL and uploaded a.csv file called visual ETL conference data.csv.

Change

Add a Data Preparation Recipe and launch a data preview session once the node has been configured. This session usually takes two to three minutes to begin.

Select Author Recipe to begin an authoring session and add transformations when the data frame is finished, once the data preview session is ready. You can inspect the data, apply transformation steps, and see the modified data interactively during the authoring session. The steps can be reversed, repeated, and rearranged. The statistical characteristics of each column as well as its data type are visible.

Fill up

After you’ve interactively prepared your data, you can share your work with data engineers so they may add custom code and more sophisticated visual ETL flows to easily incorporate your work into their production data pipelines.

Currently accessible

Now accessible to the general public in all commercial AWS Regions where AWS Data Brew is offered, is the AWS Glue data preparation writing experience. Go to AWS Glue to find out more.

Read more on govindhtech.com

#aws #gluestudio #builddata #writingcode #awsgluestudio #machinelearning #authenticating #awsglue #fileup #Currently #accessible #ml #technology #technews #news #govindhtech

0 notes

seo-techroomage · 8 years ago

Text

代碼中的漢字為什麼能擋住CIA黑客

New Post has been published on https://seo.techroomage.com/%e4%bb%a3%e7%a2%bc%e4%b8%ad%e7%9a%84%e6%bc%a2%e5%ad%97-%e7%82%ba%e4%bb%80%e9%ba%bc%e8%83%bd%e6%93%8b%e4%bd%8fcia%e9%bb%91%e5%ae%a2/

代碼中的漢字為什麼能擋住CIA黑客

日前，維基解密爆料，CIA通過惡意軟體等網路武器控制大量美國、歐洲等地企業的電子設備及操作系統產品，包括蘋果手機、谷歌安卓系統、微軟視窗系統和三星智能電視，把它們變成麥克風進行竊聽，並將錄音傳輸到中情局伺服器上。

此外，維基解密爆料的一份文件中顯示：執行中國任務的特工深受語言障礙的困擾。《參考消息》還以《維基解密網披露代碼中的漢字擋住CIA黑客》為標題做了報道。那麼，中文擋住CIA黑客究竟是怎麼回事呢？

中文並不是抵擋CIA黑客的長遠之計

雖然參考消息的報道以《維基解密網披露代碼中的漢字擋住 CIA黑客》為標題，且該標題頗有因為中文使CIA黑客束手無策，無法竊取中國秘密資料的含義。但事實上，這僅僅是CIA黑客看不懂中文導致的，若要實現資訊安全，僅僅依靠源代碼中的中文或中文註釋是遠遠不夠的，而且這也非長遠之計。由於絕大多數程序代碼都是用通用編程語言寫成，這些英文字母組成的代碼，全球程序員都認識，但認識代碼歸認識，能不能徹底解讀就是另一回事了——源代碼一般都是比較晦澀的，沒有註釋的代碼換了開發這個程序的工程師之外的人來讀其實是很難讀懂的。

「維基解密」近期公布的文件顯示，美國中情局通過各種方式攻擊了中國等很多國家的電子設備。外交部回應：敦促美方攻擊停止。

國內一些企業和境外企業合資做CPU，或買授權做SOC，以及合資做所謂的Windows 10政府版操作系統，雖然購買了境外企業的授權，有可能獲得了部分源代碼，但設計文檔和註釋這些一般是無法從境外國際公司處獲取的。這也是為什麼，無論是CPU，還是操作系統，雖然在「十二五」期間，一些企業得到巨額專項資金扶持，但耗費數年時間和巨額資金，至今依舊拿境外技術穿馬甲的原因之一。

而這次維基解密曝光的情況，其實是CIA黑客通過特殊手段獲得源代碼后，發現源代碼沒有英文註釋所以讀不懂。

雖然不少科班出身的程序員，一般都是循規蹈矩按部就班的用英文註釋。但由於部分中國的軟體工程師英文水平不夠高，甚至一些程序員的英文其實非常有限，如果用英文註釋很可能會出不少問題，因此會在源代碼中用中文做註釋。

在英文水平有限的情況下，如果用英文而不是中文註釋的話，可能會有翻譯錯誤、字母打錯、以及英文專用名詞過於生僻等一些問題。舉例來說，比如創建時間，一般翻譯CreateTime，但是也有人寫成BuildDate，甚至有的時候會有字母打錯，變成BuildData的情況，這種還是算能夠看出來是打錯字母的。

更多時候，打錯字母的英文單詞會導致其他程序員解讀難度大幅攀升，怎麼猜測都不對。如果用中文的話，不僅方便國內同行理解，也可以少發生產生歧義的情況。

另外，由於英文專業名詞都是非常生僻的，非該專業領域的業內人士根本不認識該專業的相關專業術語的英文單詞。以電力方面來說，程序員僅僅是碼農，不是電力工程師，這就導致國內軟體工程師在接國內項目后，根本不知道相關專業術語的英文單詞，在這種情況下，就直接用中文或者拼音了。

一位軟體工程師告訴筆者，「在XX電網的時候開發一套系統，裡面上千個電網專業術語，如果用英文，可以撞頭去死了……所以註釋一般用中文，程序變數名用拼音」。其實，國內不少大公司也是會用中文拼音的。

而本次CIA黑客受阻於中文的真正原因，是因為中文博大精深，而國內程序員寫了中文註釋又很隨意，沒有一定的漢語文化功底的CIA黑客很難理解中文註釋，所以出現了拿到了源代碼，但是因為無法理解源代碼中的中文註釋而看不懂的情況。

不過，隨著CIA招募掌握中文的黑客參與相關工作，看不懂源代碼中的中文而產生的問題將不復存在。

維基解密稱CIA把歐洲黑客「老巢」建在美國駐德國法蘭克福領事館

打鐵還需自身硬

根據斯諾登的披露，美國政府一直通過各種手段對全球很多國家實施監控和網路攻擊，除了傳統的攻擊伺服器和PC獲得其他國家機密數據之外，隨著物聯網和各種智能硬體設備的興起，網路安全和抵禦網路攻擊的難度大幅攀升。

目前，各種智能硬體設備的增長如井噴之勢，2016年約有1.7億人購買各種物聯網的禮品，到2020年，物聯網連接的智能設備有望在全球增加到500億台。加上即將到來的5G時代，會實現萬物相連。不僅智能空調、智能電視、智能洗衣機等智能家電會和手機等個人智能終端設備，以及PC相連接，各種攝像頭監控設備、智能揚聲器、汽車電子、醫療器械、工業生產設備等智能硬體也會通過網路相連。而這些智能硬體設備都有CPU、內存、操作系統，雖然模樣千奇百怪，但其實都是一台迷你電腦。

更致命的是，這些智能硬體設備中很大一部分近乎是不設防的，在軟體上由於長年得不到更新維護，軟體系統存在大量漏洞。而很多智能硬體設備對CPU性能要求不高，反而對功耗和成本非常敏感，因而往往採用近乎於老掉牙的晶元。誠然，這些老晶元在經過多年使用和驗證，有著相對成熟的優勢。但由於晶元破解難度和晶元的複雜度成正比，加上有充足的時間去破解，因而在安全性上可能存在一定瑕疵。此外，由於西方科技公司往往和所在國政府存在某些合作，因而很多國家的互聯網和各種智能硬體設備完全暴露在國家級黑客的攻擊之下。

舉例來說，維基解密就公布了CIA利用各種技術在電腦、手機平台上的Windows、iOS、Android等各類操作系統下發起入侵攻擊，以及操作智能電視、智能監控設備等終端設備進行竊密的文件。最驚悚的是，CIA還可以遙控智能汽車發起暗殺行動。

因此，要實現網路安全，保護國家機密和個人隱私，必須採取技術手段，而非源代碼中加入一些中文內容。

技術手段主要解決的是境外國家級攻擊者，正面捅刀子和背後捅刀子的問題。

所謂背後捅刀子，就是國人使用的網路設備、電腦、伺服器、智能穿戴設備、手機等產品在軟體上和硬體上被境外科技公司留下了各種後門，這樣國家級攻擊者就可以通過這些後門肆意竊取國家秘密和個人隱私。

而境外國家級黑客攻擊就是正面捅刀子，面對正面捅刀子，就必須從軟體和硬體上加強防護措施，防止非法訪問和修改。

對付背後捅刀子，最好的方式就是從軟體和硬體上實現國產自主+安全可信，採用國內自主設計、代工生產封裝的CPU和自研的操作系統去取代國外的產品。而且這也是對付正面捅刀子的前提條件，因為如果自己不掌握核心技術，依靠購買國外產品的話，老外賣給你什麼，你用什麼，就沒法從整體上考慮安全方案，就很難提升面對黑客攻擊的防禦能力。

此外，固件也非常重要，固件就是硬體和操作系統之間的部分，基本功能是用來配置硬體引導系統。bios是計算機里最重要的固件之一，nsa、cia和hacking team都很喜歡通過bios植入木馬。伺服器里還有bmc固件，用來監控管理伺服器。在防禦黑客的方面，由於bios比操作系統先運行，在把控制權傳遞給操作系統的時候，bios可以去檢驗要運行的系統是不是符合預期，是否被篡改，因而bios可以通過傳遞信任關係到操作系統，讓惡意代碼無法執行。因此，固件在安全里有非常重要的位置。

總之，源代碼中存在中文這僅僅是一個小障礙，而非不可逾越的技術瓶頸。至於用中文編程是根本沒有必要的，雖然用中文編程可以解決一些中國人不懂英文的問題，然而現在英文是事實的世界語言，��為程序員不懂英文就沒法跟其他國家的人交流，最後很有可能會走到閉門造車的路上，反而得不償失。要真正提升抵禦網路攻擊的防禦能力，最好的做法是採用自主研發的產品取代國外產品，並在實踐中不斷使用、磨合、檢驗，最終實現螺旋式提升。

轉載文章請附上來源：代碼中的漢字為什麼能擋住CIA黑客 - Whoops SEO 搜尋引擎優化 – Search Engine Optimization

#http #三星 #智能 #科技 #谷歌 #IT資訊

0 notes