#progressive_web_application | Explore Tumblr posts and blogs

espinr · 6 years ago

Text

The Streets of Gijón: from the Newspaper Archives to the Knowledge Graph

Read this article in Spanish

Mis Calles Gijón/Xixón is a Web application started as a personal project, motivated to support my curiosity when walking in the streets of my hometown. I always wonder who was the person, historical fact or place that shaped cities and understand the motivations to name streets after them, as part of a kind of tribute to preserve the history of places.

Technologically, Mis Calles is a Progressive Web Application, suitable for any mobile device, enabling the use of their location capabilities to enrich the user experience. The application was developed through a series of complex Python scripts, which allowed to automate the ingestion and processing of the information, coming from open sources. This automation is interesting since we could deploy a Mis Calles in any city in the world with just minimal adjustments.

The development process had the following steps:

Step 1: The streets and their location

The City of Gijón, through its Transparency and Open Data Portal, publishes two datasets that served to: (1) identify the streets of Gijón —list of streets, type (e.g., avenue, road, park, street, etc.), unique names (no acute accents) and numerical identifiers—, and (2) locate them on a map. The location of the streets is based on the dataset of streets and portal numbers, a table with thousands of coordinates corresponding to each street number (building) and street in the city (not the line or polygon).

Step 2: Generate the polygons on the map

After nesting the tables of streets and numbers, another script built the lines and polygons (in KML and geo-JSON format) shaping the streets, enabling the visualization on a map.

For this, each street is divided into odd and even street numbers. The odd ones are ordered in ascending order and a polygon is generated by joining all the points that form the coordinates. Once the script reached the biggest, it joins all the even points in a descending way until reaching the lowest. Finally, the polygon is closed linking the first point of the list.

This technique is precise in certain ways, but if the street has few registered portals it can give rise to polygons with curious shapes, as can be seen in Calle Mariano Pola, below.

Step 3: Description of streets I: Piñera method

Once the streets were identified univocally, including their approximate location, the streets were described based on the El Comercio's publication: The streets of Gijón, History of their names, a great work by Luis Miguel Piñera.

The portal offers open access to this publication but it does not specify reuse conditions. After several formal use requests to the City Council, and receiving no response (I am still waiting), I asked directly to El Comercio. They quickly agreed and wanted to take part in this project.

In early 2000, you would not expect more than reading that publication as PDF, so no efforts were made beyond the PDF file. With a scraping script, I automated the extraction of the main data of the work: a name of the person or event that named after the street, description, date of registration and previous names. The extraction was chaotic (e.g., changes in the expected structure, page breaks, figures, cites, etc.), and the result had to be double checked almost manually.

Another Python script nested the resulting tables: list of streets, geo-localized and now including a description. In the case of not finding the exact match between the denomination (e.g., comparison of 'LOS CALEROS' and 'CALEROS, LOS'), the script prompted asking for the right choice.

Step 4: Description of streets II: Knowledge Graph

Although the real value of the application comes from Piñera's historical investigation, several streets were not included. In this case, these entities are usually concepts or contemporary people that can be easily recognizable and whose descriptive information can be found on the Web. So, another method of finding information associated with those entities was searching the Knowledge Graph, which Google offers through an API.

The Knowledge Graph is a structure of information linked by semantic concepts —is what Google uses to show proper results. This graph contains information from open sources, such as Wikipedia. It includes metadata associated with the concepts that characterize them in depth.

This data source is really interesting to get information about places, events or people. The script queries the API recognizing the types of results obtained. The results also show the degree of success confidence. So in the case of finding a match of the expected type, the script stores the data in a table with the needed information: description, alternative title, URL of the entity (Wikipedia page), image URL and associated license, standardised entity type using schema.org (eg, CivicStructure or LandmarksOrHistoricalBuildings). For example, the result obtained for the entity'EUROPA' [Square], is:

Name: Europa

Score (confidence): 244.048874

URL: https://es.wikipedia.org/wiki/Europa

Type: Continent, Thing, Place, AdministrativeArea

Description: Europe is one of the continents that make up the Eurasian supercontinent, located between the parallels 36º and 70º north latitude.

Imagen: http://t0.gstatic.com/images?q=tbn:ANd9GcS0uZNMs4cfxsd-XAvy8iNVY-6CjX9gMCUV2BSloYIjgEQewznu

Image license: https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License

Knowledge Graph URI: kg:/m/02j9z

Unfortunately, Google has removed the dereference and visualization of the Knowledge Graph entities —a few weeks ago, we could visualize an infobox with the information related to the entity. Anyway, this does not affect the application.

Step 5: Types of Entities

The Knowledge Graph, and its queries were the basis of the classification of the entities in an automated way. In the previous example, Europe is of type: Thing (thing, of course); Continent (continent); Place (place); and AdministrativeArea (administrative area).

Approximately, the 40% of the entities represented were classified automatically, always under human supervision, since there are confusing cases in which artificial intelligence can not do anything.

For example, Calle 'Cabrales' refers to the concept of 'Cabranes'. Within the Knowledge Graph, both concepts are identified and described as municipalities in Asturias, but actually, the street is named after a person.

Once all the tables were joined and nested (listing, geo-positioning, descriptions with the Piñera method and the Knowledge Graph, and types) I completed the classification using a subsequent semi-automatic homogenization of the types. For this, I used OpenRefine.

Finally, all the themes were defined in an intuitive taxonomy through a card sorting exercise. As a result, 1035 entities were grouped as:

(58) Artists (musicians, painters, actors, athletes);

(53) Concepts (feelings, jobs, physical elements)

(11) Explorers (historic conquerors)

(220) Geography (municipalities of Asturias, geographical features, populated places, etc.)

(113) History (historical events, people, landmarks of historical relevance)

(69) Business (people related to industrial and/or urban development)

(112) Letters (writers, thinkers, philosophers, etc.)

(105) Nature (fauna, flora, and other natural concepts)

(21) Nobility (monarchs and nobility in general)

(27) Organisations (companies and organizational entities)

__(94) Politicians (governors, military and councilors)

(79) Religion (religious people, places and concepts)

(73) Scientists (mathematicians, physicists and science or technology persons)

Also, taking advantage of other concrete types gathered by the scripts, I produced a second level of topic themes. For example, the subject Geography includes the following sub-themes: Country (27 entities), Geography of Asturias (66), Geography of Spain (44), Geography of Gijón (21), International Geography (13), Historical Places (1), Orography (48).

Have a look at the result of Mis Calles Gijón/Xixón, play with it and send me your comments.

#opendata #knowledge graph #tourism #history #local #people #streets #kg #gijon #xixon #webapp #semantic web

0 notes

espinr · 6 years ago

Text

Mis Calles Gijón/Xixón: de Piñera al Knowledge Graph

Read this article in English.

La aplicación Mis Calles Gijón/Xixón surge de un proyecto personal para dar soporte a la curiosidad, que a muchos de nosotros nos despierta, el callejero y los personajes y hechos históricos que dan nombre a las ciudades y preservan su historia a través de estos homenajes ofrecidos por su ciudadanía y gobiernos.

Tecnológicamente, Mis Calles es una Aplicación Web Progresiva, adecuada a cualquier dispositivo móvil y que utiliza sus capacidades de localización para enriquecer la experiencia de los usuarios. La aplicación ha sido desarrollada mediante una serie de complejos scripts escritos en Python, que han permitido automatizar la ingesta y procesamiento de los datos, procedentes de fuentes de datos abiertos. Esto es interesante, ya que podríamosdesplegar un Mis Calles en cualquier ciudad del mundo con unos ajustes mínimos.

El proceso de construcción ha sido el siguiente:

Paso 1: Las calles y su localización

El Ayuntamiento de Gijón, a través de su Portal de Transparencia y Datos Abiertos, publica dos conjuntos de datos que me han servido para: (1) identificar las Calles de Gijón —listado delas calles, tipo de vía (avenida, camino, calle, etc.), nombre únicos (sin tildes) y con identificadores numéricos—, y (2) localizarlas sobre un mapa. La localización de las calles se hace basándose en el conjunto de datos de Calles y números de portales, una tabla con miles de coordenadas correspondientes a cada portal de cada calle de la ciudad.

Paso 2: Generar los polígonos sobre el mapa

Tras anidar las tablas de calles con la de los números, se generó otro script encargado de construir las líneas y polígonos (en formato KML y geo-JSON) que forman las calles y posteriormente permitirán determinar si nos encontramos en una calle u otra, así como visualizarlo en un mapa.

Para ello, cada calle se divide en número de portal pares e impares. Los impares se ordenan de forma ascendente y se genera un polígono uniendo todos los puntos que forman las coordenadas; al llegar al mayor se une con las coordenadas del mayor de los pares y luego uniendo todos los pares de forma descendente hasta llegar al menor. Finalmente, se cierra el polígono con el primero de los puntos.

Esta técnica es precisa en ciertas vías, pero si la calle tiene pocos portales registrados puede dar lugar a polígonos con formas curiosas, como se puede ver en la Calle Mariano Pola, a continuación.

Paso 3: Descripción sobre las calles I: método Piñera

Una vez identificadas las calles unívocamente, incluso su localización aproximada, se dota de descripción a las calles hemos utilizado una publicación editada por El Comercio y distribuida en abierto por el propio Ayuntamiento: Las calles de Gijón, Historia de sus nombres, una gran obra de recopilación histórica de Luis Miguel Piñera.

Desde el portal se ofrece derecho de acceso pero no se especifican condiciones de reutilización, así que tras solicitar repetidamente su uso formalmente al Ayuntamiento y no recibir respuesta (sigo esperando), se hace la solicitud a El Comercio, quienes acceden encantados.

Ya que a principios del 2000 no se esperaría otra cosa que leer el PDF, así que tampoco se hicieron esfuerzos más allá que conservar el archivo PDF. Con un script se automatizó la extracción de los datos principales de la obra: título de la entidad que da nombre a la calle, descripción, fecha de registro y nombres anteriores. La extracción dio lugar a errores, debido a cambios en la estructura esperada, como los saltos de página por ejemplo. Hubo que hacer un repaso manual para evitar textos fuera de lugar, como pies de foto u otros elementos como referencias o números de página intercalados en las descripciones.

Otro script juntó las distintas tablas: listado de calles, geo-localizadas y ahora con una descripción. En el caso de no encontrar la coincidencia exacta entre la denominación de las entidades (p.e., comparación de ‘LOS CALEROS’ y ‘CALEROS, LOS’), se le ofrecía al humano potenciales coincidencias.

Paso 4: Descripción de las calles II: Knowledge Graph

Aunque el valor real de la aplicación viene de la investigación histórica de la publicación de Piñera hay ciertas calles, cuyas entidades no aparecen en la publicación. En este caso suelen ser conceptos o personas contemporáneas, fácilmente reconocibles y cuya información descriptiva puede ser encontrada en la Web. Así que otro método de encontrar información asociada a las entidades ha sido la búsqueda en el grafo de conocimiento Knowledge Graph, que Google ofrece a través de su API.

El Knowledge Graph es una estructura de información enlazada por los conceptos semánticos —lo que usa Google para mostrar resultados adecuados a las necesidades de los usuarios tras una búsqueda. Este grafo contiene principalmente información procedente de fuentes abiertas, como es la Wikipedia, e incluye metadatos asociados a los conceptos que lo caracterizan en detalle.

Esta fuente de datos es realmente interesante para obtener información relativa a lugares, eventos o personas. El script de consulta al API hace distinción entre los tipos de resultados obtenidos. Los resultados también muestran el grado de confianza del éxito en la correspondencia, así que en el caso de encontrar una coincidencia del tipo esperado, registra los datos necesarios para la aplicación: descripción, título alternativo, URL de la entidad (página de la Wikipedia), URL de la imagen relacionada y licencia asociada, tipo de entidad normalizado usando schema.org (p.e., CivicStructure o LandmarksOrHistoricalBuildings). Por ejemplo, el resultado obtenido para la entidad [Plaza de] ‘EUROPA’, es:

Nombre: Europa

Score (confianza): 244.048874

URL: https://es.wikipedia.org/wiki/Europa

Tipo: Continent, Thing, Place, AdministrativeArea

Descripción: Europa es uno de los continentes que conforman el supercontinente euroasiático, situado entre los paralelos 36º y 70º de latitud norte.

Imagen: http://t0.gstatic.com/images?q=tbn:ANd9GcS0uZNMs4cfxsd-XAvy8iNVY-6CjX9gMCUV2BSloYIjgEQewznu

Licencia de la imagen: https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License

URI del Knowledge Graph: kg:/m/02j9z

Lamentablemente, Google ha quitado la resolución y visualización de las entidades del Knowledge Graph (hace unas semanas se podía visualizar un infobox con la información relativa a la entidad), pero no afecta a esta aplicación.

Paso 5: Tipos de Entidades

El Knowledge Graph, y sus consultas fueron la base para el comienzo de la clasificación de las entidades de manera automatizada. En el ejemplo anterior, Europa es de tipo: Thing (cosa, por supuesto); Continent (continente); Place (lugar); y AdministrativeArea (área administrativa).

Aproximadamente, un 40% de las entidades representadas fueron clasificadas de forma automática, siempre bajo la supervisión humana, ya que hay casos confusos en los que la inteligencia artificial no puede hacer nada.

Por ejemplo, la Calle ‘Cabrales’ se refiere al concepto ‘Cabranes’. Dentro del Knowledge Graph, ambos conceptos son identificados y descritos como municipios asturianos, pero realmente la calle es por una persona.

Una vez unidas todas las tablas (listado, geoposición, descripciones con el método Piñera y el Knowledge Graph, y tipos) se procedió a completar los tipos y posterior homogeneización semi-automática. Para ello se utilizó OpenRefine.

Para finalizar, se hizo un ejercicio de card sorting enfocado a agrupar conceptos por temáticas y definir una taxonomía intuitiva, que permita la clasificación de las entidades. Como resultado, las 1035 entidades obtenidas, se agrupan en:

(58) Artistas (músicos/as, pintores/as, actores/actrices, deportistas);

(53) Conceptos (sentimientos, oficios, otros elementos físicos)

(11) Exploradores (navegantes, conquistadores, etc.)

(220) Geografía (concejos asturianos, accidentes geográficos, lugares poblados, etc.)

(113) Historia (eventos históricos, personas, lugares y parajes de relevancia histórica)

(69) Industriales (personas relacionadas con el desarrollo industrial o urbanístico)

(112) De Letras (literatos, pensadores, filósofos, etc.)

(105) Naturaleza (fauna, flora y otros conceptos naturales)

(21) Nobles (monarcas y nobleza en general)

(27) Organizaciones (empresas o entidades organizativas)

(94) Polític@s (gobernantes, militares y administradores públicos)

(79) Religión (personas o lugares relacionadas con la religión)

(73) Científic@s (matemáticos, físicos o relacionadas con la ciencia o tecnología)

Asimismo, aprovechando los tipos más avanzados de los que se disponía, se estableció un segundo nivel en la taxonomía de temas. Por ejemplo, el tema Geografía incluye los subtemas: País (27 entidades), Geografía de Asturias (66), Geografía de España (44), Geografía de Gijón (21), Geografía Internacional (13), Lugares históricos (1), Orografía (48).

Puedes echarle un vistazo al resultado de Mis Calles Gijón/Xixón, jugar con ella y enviarme tus comentarios.

#miscalles #gijón #xixón #desarrollo #python #knowledge graph #kg

0 notes