I am resurrecting this very old blog to talk about my journey through computer science : things about operating systems, data science, distributed computing and stuff I have not yet thought about
Don't wanna be here? Send us removal request.
Text
Application Security : CSRF
Cross Site Request Forgery allows an attacker to capter or modify information from an app you are logged to by exploiting your authentication cookies.
First thing to know : use HTTP method carefully. For instance GET shoud be a safe method with no side effect. Otherwise a simple email opening or page loading can trigger the exploit of an app vulnerability
PortSwigger has a nice set of Labs to understand csrf vulnerabilities : https://portswigger.net/web-security/csrf
Use of CSRF protections in web frameworks
Nuxt
Based on express-csurf. I am not certain of potential vulnerabilities. The token is set in a header and the secret to validate the token in a cookie
Django
0 notes
Text
Data engineering on GCP
This is a practical path to data engineering. As a result, we will not start by fondamental of distributed systems and data storage but will directly use cloud based tools to understand how to build data pipelines.
We will mainly use Google Cloud Skill Boost as source of material.
Learning path :
Apache Beam : motivations for this batch and streaming programming model https://youtu.be/owTuuVt6Oro
0 notes
Text
The pony programming language
It looks like this deals with some concepts I am interested in learning about : actor systems, distributed scheduling. I will try to answer 2 questions :
- why was this language created - what are actor systems
The early history of pony
This blog post gives a fair overview of the reason for the language existence : https://www.ponylang.io/blog/2017/05/an-early-history-of-pony/
When working of low-latency - high throughput environments, exchanging data is a challenge. You need a system able to deal with high concurrency. It needs to be very performant : compiled but with shared use of data.
How to avoid counting references to a same object (slow for performance)? How to avid making to many copies of the same object (same issue) ?
Is C/C++ the language adapted to do this ? Apparently not. Why ? data races & deadlocks. Difficult to debug => loss of productivity for developers.
Is the actor system a solution to this ? Maybe.
Pony is...
statically typed
compiled
highly concurrent
based on the actor model
object oriented
based on composition rather than inheritance
Pony has...
a garbage collector (which implications on performance?)
traits
interfaces
What is this thing ?
Primitive : like a class but immutable and a singleton => enum, collection of functions. Have functions _init and _final
Behaviour : a behaviour returns None the way an async function would return a Promise since the computation has not been done yet.
Fun stuff
Pony does not have a null.
It also does not have global variables.
Pony assignment are expressions (and not statements). So they return a value. Funny enough they return the old value before assignment rather than the new.
There are no locks so no deadlocks.
Actor system
An example of actor-model language ? Erlang, Elixir, or Akka
Actors communicate with each other through message passing.
Hypothesis : learn pony to learn rust later ? Yay or Nay.
Question ? Asnychronicity allows concurrency ? What are the other ways to have concurrency ?
0 notes
Text
Search for the web
https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html
I want to understand the inner workings of elasticsearch. Starting from a book looks more interesting than reading documentation :”We wrote this book because Elasticsearch needs a narrative“
What is es
A document-oriented database which also indexes the document content. Serialization is done in JSON.
storing = indexing ⇒ (an index is a database). Relational databases add an index, elasticsearch uses a structure called an inverted index.
field ⊂ document ⊂ index ⊂ cluster
Every field in a document is indexed. Mapping types have been deprecated so we create a field type to signal that the document is or type employee.
POST /megacorp/_create/1 { "first_name" : "John", "type": "employee", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] }
The concept of relevance is central to elasticsearch
0 notes
Text
Refresher on C++
Now that I am writing C++ again (the last time was at least 8 years ago), I am rediscovering a syntax long forgotten like :
the use of const keyword
what does a const argument mean
what is a const method in a class
when to pass an argument by reference
what is a virtual method of a class
how to override a virtual method in a child class
Constructors
They can be initialized with member initializer list, whose syntax is the colon character :, followed by the comma-separated list of one or more member-initializers
Nuggets of knowledge
C++ templates used at scale can slow down compile time (https://youtu.be/rX0ItVEVjHc?t=545)
0 notes
Text
Javascript interpreters
I was looking around for a simple tutorial on building a Javascript interpreter. Not only did I find a great example (https://github.com/MarcoLizza/tiny-js) but it led me to the Espruino project (code here https://github.com/espruino).
Expruino is a very light JS interpreter designed to work on microcontrollers.
0 notes
Text
Back to c++ with a new challenge
Thenks to a wonderful blog post from kipply’s blog, I’ve started to implement the Raytracing in a weekend book (see here: https://raytracing.github.io/books/RayTracingInOneWeekend.html) and I am having a blast!
0 notes
Text
It’s been a long time
Eventually I did not go to work in Data Science, technologies like Hadoop have gone out of fashion but I feel like I have a million new things to learn.
So I’m reviving this blog.
0 notes
Text
Hadoop and the Amazon Cloud - Lesson 2 - Introcuding cloud computing
Source : BigDataUniversity
Not a new concept
Giving access to the masses of the performance of a large number of servers
Illusion of an inifinite number of servers
You rent the servers for the amount of time you are using them
You pay for the amount of CPU that you use
Advantages of cloud computing :
You don't need to set up a complicated and expensive IT system (capital->operating expense)
You do not need to plan for the maximum of CPU you need, you can scale on demand
You can set it up in a few hours instead of a month for an IT farm
Cloud computing service models
Infrasctucture as a service IaaS : you don't worry about the infrastructure, it is dealt with by the cloud service
Platform as a service PaaS : you don't worry about middle infrastructure
Software as a service SaaS : you don't worry about the software, you suscribe to a monthly payment to use it
(Hardware as a service, Cloud as a service...)
Concerns of cloud computing
Security, privacy -> private cloud
Highly transactional workloads
Apps with complex regulations, resiliency
Why to use cloud
Proof of concepts
Development, tests
Apps with highly variable workloads
0 notes
Text
Hadoop Fundamentals 1 - Lesson 1 - Introduction to Hadoop
Source : BigDataUniversity, Hadoop Fundamentals 1
Lesson Summary
Use of Hadoop :
for large amount of data in relational database (~10TB), for unstructured data (Facebook...)
Framework written in Java, MapReduce as foundation
Data structured or unstructured
Massive parallel processing, no immediate -> bash
not possible to randomly access data or sequentially access data
not replacement for relational DB
Other products :
Lucene : text search engine library in Java
Hbase : Hadoop DB
Hive : data warehousing , query load data
Pig : high level language for analysing large datasets
Jaql : query language in Javacsript
Zookeeper : configuration
Avro : data serialization system
UIMA : architecture for dealing with unstructured data
Limits of Hadoop, it is not good :
when lot of small datasets
for low atency data access
for random access
when work cannot be parallelized
for intensive calculations with little data
Lab Summary
Comments and perspectives
2 notes
·
View notes
Link
0 notes
Text
Conférence Strata
Un panorama très instructif de la situation actuelle de la "science des données" : la conférence Strata organisée en janvier dernier par la maison d'éditions O'Reilly.
http://bit.ly/rc1qOW
http://siliconangle.tv/channels/StrataConf
0 notes