Topic: Extracting Blocktext tags (i.e. sections of filings)

This topic has 6 replies, 2 voices, and was last updated 2 years, 10 months ago by Satish Sahoo.

Viewing 4 reply threads

Author

Posts
- Tuesday, August 23, 2022 at 5:57 AM #202987
  
  Satish Sahoo
  Participant
  
  Hi David/Others,
  I want to extract certain sections of the 10-k filings. For example I tried to extract segment information using ‘SegmentReportingDisclosureTextBlock’ tag. I was hoping to get the entire section (including the tables and text etc).
  While I am able to pull the tag, its fact.value just gives me the details of the tag.
  For example I get-
  <p style=’margin-top:0pt; margin-bottom:0pt’><font style=”font-family:Times New Roman;font-size:10pt;font-weight:bold;margin-left:0px;”>NOTE 4.</font><font style=”font-family:Times New Roman;font-…
  
  <div style=”font-family:Times New Roman;font-size:10pt;”><div style=”line-height:120%;padding-top:18px;font-size:10pt;”><font style=”font-family:inherit;font-size:10pt;font-weight:bold;”>ACQUISITI…
  
  ACQUISITIONS, GOODWILL, AND ACQUIRED INTANGIBLE ASSETS<div style=”line-height:120%;padding-top:6px;text-indent:16px;font-size:10pt;”><span style=”font-family:inherit;font-size:10pt;font-style:ital…
  
  How can I get the entire section’s text and formatting information to put it in another HTML/Text File ? I tried footnote.* as well without any luck.
  
  Your help would be much appreciated.
  
  Thanks
- Wednesday, August 24, 2022 at 3:01 PM #203004
  David Tauriello
  Keymaster
  Hi Satish – these facts are HTML encoded; there is no ‘plain text’ version – the data is in there, but might be under several HTML tags for formatting purposes. You have a couple of options:
  - use regex in your routine to remove tags after you’ve retrieved the data (something like <.*?> should leave you with plain text, which might be tough to read … maybe replace it with spaces, tabs or line breaks?)
  - concatenate the fact.id with this string to create a URL that renders the fact: CONCAT( https://csuite.xbrl.us/php/dispatch.php?Task=htmlExportFact&FactID= , xxxxxx ) – we’re using this approach in some of the spreadsheet templates posted in the XBRL Data Community
- Thursday, August 25, 2022 at 4:18 AM #203027
  
  Satish Sahoo
  Participant
  
  Hi David,
  Thanks a lot for your response. This is very helpful.
  Just have another related quick question. I see that at least since the Inline XBRL has started, the section files are separately posted in EDGAR website filings. Is it possible to point to the URL of those files using the API? Not sure if this is within the API framework. If it’s then it would be great. Thanks
  - Friday, August 26, 2022 at 3:44 PM #203046
    
    David Tauriello
    Keymaster
    
    Hi Satish – thanks for your question. As part of the process to keep our Public Filings Database current, we make exact copies of the documents submitted to the SEC, FERC and other regulators that contain XBRL (as .xml instances or .html files that have inline XBRL in them). We do not copy the exhibit files (.htm but without XBRL), images, text files, etc.). You can use the report.sec-url field to get the page on EDGAR where these additional files exist.
- Friday, August 26, 2022 at 2:55 PM #203044
  
  Satish Sahoo
  Participant
  
  Hi David,
  After some more digging, I could get to the files that contain text for any particular TextBlock tag. But I realized that the fact.value doesn’t really contain the entire block inside it. Rather it seems to be truncated. Is there a size limit on the fact.value output ? If yes is there any setting that can be used for the fact.value to contain the entire text block within it ?
  
  As an example you can check the fact.value of the following fact id.
  https://csuite.xbrl.us/php/dispatch.php?Task=htmlExportFact&FactID=221545926
  
  Thanks
  - Friday, August 26, 2022 at 4:08 PM #203047
    
    David Tauriello
    Keymaster
    
    Hi Satish – there might be a character limit if you’re trying to get the HTML from spreadsheet. This is why we use a hyperlink in spreadsheet to the browser view of the fact when there’s a “<\” character combination.
    
    If you query with curl or python, or use an API testing tool, you should see all of the HTML (the data in the HTML we present is the same data in our database).
- Saturday, August 27, 2022 at 4:59 AM #203054
  
  Satish Sahoo
  Participant
  
  Thanks, David. I think you pointed me in the right direction. I guess the truncation is happening when I am writing the json list which is the output from the API into a panda data frame. So the API is still producing the entire section. It’s just the output rendering that is causing the truncation. Thanks
Author

Posts